CNDB-13745: Fix checksum for IncrementalChecksumSequentialWriter edge case #1726

michaeljmarshall · 2025-05-09T04:31:00Z

What is the issue

Fixes: https://github.com/riptano/cndb/issues/13745

What does this PR fix and why was it fixed

In the IncrementalChecksumSequentialWriter, we call writeMostSignificantBytes, and that has a minor "bug" in the implementation in the BufferedDataOutputStreamPlus class. In the following logic:

    @Override
    public void writeMostSignificantBytes(long register, int bytes) throws IOException
    {
        assert buffer != null : "Attempt to use a closed data output";
        if (buffer.remaining() < Long.BYTES)
        {
            super.writeMostSignificantBytes(register, bytes);
        }
        else
        {
            int pos = buffer.position();
            buffer.putLong(pos, register);
            buffer.position(pos + bytes);
        }
    }

when we call super.writeMostSignificantBytes, we accidentally double count the next bytes in the checksum. This is in part due to the hybrid solution. Instead, I propose that we remove the naive solution (that is likely slower) and instead just do the bit math that calls the appropriate writeBytes, writeShorte, etc. methods.

- use `cassandra.remote_storage_handler_factory` to replace `cassandra.remote_storage_handler` - add `TableMetadata` to `StorageProvider#createDataDirectories` signature - use `enableAutoCompaction` to replace `StorageHandler#isRemote` - emit SSTaleAddingNotification in LifecycleTransaction#checkpoint before new sstables are made visible. Note that early open is not supported because it opens sstables before `trackNewWritten`

Make `RowFilter` non-iterable because it’s confusing in a tree of expressions. Also: * Change the meaning of `RowFilter#expressions()` to traverse the entire tree of expressions. * Add a `#withoutDisjunctions()` utility method for legacy indexes. * Inspect and update if needed all the uses of RowFilter iteration to ensure that they are valid. * Remove the `warnIfFilterIsATree` warning since it should be clear what each method does.

Closes riptano/cndb#11932 The anti-join strategy for the inequality operator, `!=`, can be replaced with a union strategy of two semi-ranges, which make the plan explicit and can be more efficient. This commit plans the inequality operator with the union of two semi-ranges, thus no anti-join iterator is used for it. In the case of truncatable types as Big Decimal and Big Integer it implements full index scan as the range index scan plan node without bounds. This PR also adds tests for `not contains key` to discard the usage the union strategy by testing the false negative case of an empty map.

This performance regression was introduced by 9cabec5. I haven't been able to find the exact commit responsible for this regression, but essentially, we switched from `DirectReaders` to the lucene `DirectReader` around 0774345 and 6bd00d1, and that led to a lot of unnecessary object creation. In my initial testing, this change appeared to improve numeric query throughput from `282.03it/s` for 100k queries to `333.15it/s` for 100k queries. The memory profile also showed far fewer objects created. Note: the above numbers might have been from variability in my testing. It would be helpful to test in a more controlled environment. Either way, based on my understanding of the objects involved, this should generally produce better results because we'll have fewer objects. Finally, this change is especially helpful because the `sort then filter` logic has to initialize the the iterators for search, so it helps in both execution paths.

This splits compactions that are to produce more than one output sstable into tasks that can execute in parallel. Such tasks share a transaction and have combined progress and observer. Because we cannot mark parts of an sstable as unneeded, the transaction is only applied when all tasks have succeeded. This also means that early open is not supported for such tasks. The parallelization also takes into account thread reservations, reducing the parallelism to the number of available threads for its level. The new functionality is turned on by default. Major compactions will apply the same mechanism to parallelize the operation. They will only split on pre- existing boundary points if they are also boundary points for the current UCS configuration. This is done to ensure that major compactions can re-shard data when the configuration is changed. If pre-existing boundaries match the current state, a major compaction will still be broken into multiple operations to reduce the space overhead of the operation. Also: - Introduces a parallelism parameter to major compactions (`nodetool compact -j <threads>`, defaulting to half the compaction threads) to avoid stopping all other compaction for the duration. - Changes SSTable expiration to be done in a separate `getNextBackgroundCompactions` round to improve the efficiency of expiration (separate task can run quickly and remove the relevant sstables without waiting for a compaction to end). - Applies small-partition-count correction in `ShardManager.calculateCombinedDensity`.

…CS (#1464) - add configuration to skip mutating STATS after receiving ZCS sstable

Expose some methods needed for riptano/cndb#12128

Unbounded queue length at the native transport stage can caused large backlogs of requests that the system processes, even though clients may no longer expect a response. This PR implements a limited backport of CNDB-11070, introducing the notion of a native request timeout that can shed messages with excessive queue times at the NTR stage as well as async read/write stages, if enabled. Cross-node message timeouts are also now respected earlier in the mutation verb handler. This is a fairly straightforward cherry-pick of #1393 targeting main instead of cc-main-migration-release.

This reverts commit c06c94c. It seems the removal of `Index#postProcessor` by CNDB-11762 broke some tests in CNDB's `MultiNodeBillingTest`. Unfortunately that patch [was merged before creating the PR bumping the CC version used by CNDB](#1422 (comment)). [The CNDB PR](riptano/cndb#12076) was created after that merging but it was superseded by other CC version bumps. So I'm adding this reversal so we can investigate how the removal of `Index#postProcessor` affects those tests.

This patch replaces null values of `deterministic`, `monotonic` and `monotonic_on` columns in `system_schema.functions` and `system_schema.aggregates` with negative defaults. These defaults will be addressed if/once DB-672 gets ported to CC.

There are two mechanisms of detecting that the cluster is in the upgrade state and the minimum version. Both are slightly different, and both are not pluggable which means that CNDB doesn't work properly with them. Those mechanisms are implemented in `Gossiper`. Although we do not use `Gossiper` in CNDB, there are classes like `ColumnFilter` which go to `Gossiper` to check the upgrade state. So far, using that stuff in CDNB was a bit unpredictable, some of them reported the cluster is upgraded and in the current version, the other did not. This turned out to be a problem, especially for the `ColumnFilter` because when we upgrade DSE --> CC, CC assumes that the newest filter version should be used, which is not correctly deserialized and interpreted by DSE. The fix is not small, but it probably simplifies stuff a bit. First of all, two mechanism are merged into one. Moreover, we added pluggability of it so that we can provide the appropriate implementation in CNDB coordinators and writers, which is based on ETCD.

Part of riptano/cndb#12139 Moves constant shard count outside looping shards to reduce confusion.

…with DurationSpec type and 'native_transport_timeout_in_ms' as convertible old name with Long type; add some tests.

…MemtableIndexTest and TrieMemtableIndexAllocationsHeapBuffersTest from main branch.

…strictions (#1449) Closes riptano/cndb#12139 This PR adds a test of row count of a SAI plan in the presence of restrictions. Currently it tests queries with inequality, equality and half-ranges on different SAI column types and with or without histograms.

…g VIntOutOfRangeException to the catch block of SSTableIdentityIterator.exhaust method.

…pactionProgress to return the operation type from the first entry in the shared progress; needed in cases that a CompactionTask type is changed after creation.

…opriate (#1469) Fixes riptano/cndb#12239 We found the System.nanoTime was using significant cpu cost, but because the timeout is high enough, we can accept the inaccuracy. - [ ] Make sure there is a PR in the CNDB project updating the Converged Cassandra version - [ ] Use `NoSpamLogger` for log lines that may appear frequently in the logs - [ ] Verify test results on Butler - [ ] Test coverage for new/modified code is > 80% - [ ] Proper code formatting - [ ] Proper title for each commit staring with the project-issue number, like CNDB-1234 - [ ] Each commit has a meaningful description - [ ] Each commit is not very long and contains related changes - [ ] Renames, moves and reformatting are in distinct commits

…miss) (#1475)

Fixes regression in jvector 3.0.4 when compacting PQVectors larger than 2GB

SimpleClientPerfTest has been failing in CI since changes from CNDB-10759 This change in `SimpleClientPerfTest`, updates the anonymous class `Message.Codec<QueryMessage>` to override the correct method, `public CompletableFuture<Response> maybeExecuteAsync` from `QueryMessage`, whose signature was changed as part of CNDB-10759. - [ ] Make sure there is a PR in the CNDB project updating the Converged Cassandra version - [ ] Use `NoSpamLogger` for log lines that may appear frequently in the logs - [ ] Verify test results on Butler - [ ] Test coverage for new/modified code is > 80% - [ ] Proper code formatting - [ ] Proper title for each commit staring with the project-issue number, like CNDB-1234 - [ ] Each commit has a meaningful description - [ ] Each commit is not very long and contains related changes - [ ] Renames, moves and reformatting are in distinct commits

…ing for async batchlog removal (#1485) The test asserts that the batchlog is removed immediately after the write completes, but removal of the batchlog is async and can be delayed, particularly in resource-limited environments like CI.

The test generates partition index accesses by reusing the same key, and if the key cache is enabled, the test will fail for bigtable profiles because the key will be in the key cache.

…by filtering queries (#1484) Queries creating fake index contexts each create their own context, which can then race on metric registration (as the metrics have the same patterns). This can cause a query to fail. These metrics are superfluous, we can skip creating them entirely.

… index format version 'dx', 'cx', or older.

…ate.accumulatedDataSize; it only worked to fix SensorsWriteTest.testMultipleRowsMutationWithClusteringKey for SkipListMemtable and may not be necessary if the default memtable becomes TrieMemtable. Revisit SensorsWriteTest later if necessary.

Move static class TrieMemtable.Factory to TrieMemtableFactory class; Use suggested TriePartitionUpdate.unsharedHeapSize implementation; Use InMemoryTrie.shortLived in TrieToDotTest and TrieToMermaidTest; Add specific versions aa, ca, and da to RowIndexTest;

Add addMemoryUsageTo in SkipListMemtable and TrieMemtable Add TrieMemtable.switchOut

…om DSE

…-12503

) ### What is the issue ChunkCache needs to invalidate the test file with invalidateFileNow

### What is the issue The `main-5.0` branch uses `TrieMemtable` by default and the `test/conf/cassandra.yaml` configures `ByteOrderedPartitioner`. In this environment, `CQLSSTableWriterDaemonTest` tests `testSkipBuildingIndexesWithSAI` and `testWriteWithSAI` are failing. ### What does this PR fix and why was it fixed This fix moves the SAI specific tests from `CQLSSTableWriterTest` to a new `CQLSSTableWriterSAITest` so that `Murmur3Partitioner` can be explicitly set for the SAI specific tests.

…ngWIthAggregateTest to use paging in bytes for CC. (#1702) ### What is the issue With the rebase of CC on C* 5.0.2, there is a new test CoordinatorReadLatencyMetricTest and CoordinatorReadLatencyMetricTest.internalPagingWithAggregateTest is failing. ### What does this PR fix and why was it fixed Uses paging in bytes instead of paging in rows in internalPagingWithAggregateTest.

Removes unnecessary tiebreaker column, since the word counts give different scores for text with the same term count. One value is improved to remove possibility for flakiness in the order. Adds comments with the term and word counts to help verifying the expected results. A helper function to count them is added. Also adds a small tests for analyzer with lowercase.

#1691) Executing a query with BM25 search and a condition on partial SSTable results in empty iterator access error. And there was no test with storing data in segments. The PR implements BM25 search tests with splitting data into two tables. This reproduced this bug, CNDB-13696, and demonstrates current confusion on the BM25 ordering result to be fixed by CNDB-13553. This PR adds a check for empty iterator created for a PK belonging to another segment. This fixes the bug of trying to get the first element of an empty iterator.

…d of bytes (#1699)

…nTest (#1664) ### What is the issue `PendingAntiCompactionTest.testRetriesTimeout` is flaky on `main` failing with a timeout reported in the junit test task. riptano/cndb#13524 ### What does this PR fix and why was it fixed `PendingAntiCompactionTest.testRetriesTimeout`, and a couple other tests, are calling `Future.get` without any timeout. This change adds a 30 seconds timeout to prevent an indefinite wait for the future and a log in the failing test rather than in the junit test task.

…tricsTest

### What is the issue This is a follow up test for riptano/cndb#13696. ### What does this PR fix and why was it fixed The fix was already merged, but this confirms that without the fix, collections are affected by the NoSuchElementException bug.

Relates to riptano/cndb#13766 Brings in several improvements. CNDB PR: riptano/cndb#13789

### What is the issue Didn't create an issue as this is just a new test. ### What does this PR fix and why was it fixed We implicitly rely on this and are exposing this as a feature to our end users, so we want to confirm that the stop words are the expected stop words.

Fixes: riptano/cndb#13718 We had adopted this version of lucene datastax/lucene@5ea8bb4 in order to support our custom modifications of HNSW on top of lucene. We now use https://github.com/datastax/jvector for vector search and no longer need a custom build. I propose we use 9.8.0 since that is the closest release to the one we have been using. CNDB test pr: riptano/cndb#13757

… case

github-actions · 2025-05-09T04:31:17Z

blambov · 2025-05-09T07:35:32Z

Because it's involved in writing vints, writeMostSignificantBytes may be the most heavily used method in the serialization code. I'd be quite wary of dropping the fast path in BufferedDataOutputStreamPlus without proving that it does not hurt us. Can you please check the results of VintCodingBench, especially testWriteRandomLongDOP?

If you are interested, take a look at CASSANDRA-15215 where this was introduced.

michaeljmarshall · 2025-05-09T13:52:27Z

Thank you for the reference @blambov. It looks like we don't even use the IncrementalChecksumSequentialWriter in https://github.com/apache/cassandra/blob/cassandra-5.0/src/java/org/apache/cassandra/index/sai/disk/io/IndexFileUtils.java, so perhaps there is a different direction we should go @djatnieks?

djatnieks · 2025-05-12T05:00:29Z

perhaps there is a different direction we should go @djatnieks?

Interesting, ok I will take a look

jasonstack and others added 30 commits January 29, 2025 17:07

rebase fix CNDB-10718: Make RowFilter non-iterable

5bfb179

CNDB-12130: read source file sequentially when sending file through Z…

cfd4707

…CS (#1464) - add configuration to skip mutating STATS after receiving ZCS sstable

CNDB-12128 Expose to CNDB some methods around ExpirationTask (#1465)

8c07be0

Expose some methods needed for riptano/cndb#12128

CNDB-12139 minor refactor row count estimation code (#1463)

3382e6c

Part of riptano/cndb#12139 Moves constant shard count outside looping shards to reduce confusion.

CC5 Fix for CNDB-11070 to use 'native_transport_timeout' as 5.0 name …

438e810

…with DurationSpec type and 'native_transport_timeout_in_ms' as convertible old name with Long type; add some tests.

CNDB-12215 Replace [On|Off]HeapTrieMemtableIndexTest classes withTrie…

d7e7c1c

…MemtableIndexTest and TrieMemtableIndexAllocationsHeapBuffersTest from main branch.

CNDB-12222 Fix flaky tests in SSTableCorruptionDetectionTest by addin…

58613e8

…g VIntOutOfRangeException to the catch block of SSTableIdentityIterator.exhaust method.

CNDB-12223 Fix CorruptedSSTablesCompactionTest by modifying SharedCom…

4da94f5

…pactionProgress to return the operation type from the first entry in the shared progress; needed in cases that a CompactionTask type is changed after creation.

CNDB-12257: do not call System.getProperty on an hotpath (ChunkCache …

5f30fdf

…miss) (#1475)

CNDB-12277: Upgrade to jvector 3.0.6 (#1478)

42b8328

Fixes regression in jvector 3.0.4 when compacting PQVectors larger than 2GB

Improve typed reads in RandomAccessReader

a898435

CNDB-11437: Disable key cache in LazyBloomFilterTest

5fb48a0

The test generates partition index accesses by reusing the same key, and if the key cache is enabled, the test will fail for bigtable profiles because the key will be in the key cache.

CNDB-9850 followup: BtiFormat set byteComparableVersion based on trie…

ec55992

… index format version 'dx', 'cx', or older.

CNDB-9850 followup: more alignment with main branch

d884ce7

Add addMemoryUsageTo in SkipListMemtable and TrieMemtable Add TrieMemtable.switchOut

pkolaczk and others added 21 commits April 14, 2025 14:17

CNDB-13582: Ignore TOC when discovering SAI components of sstables fr…

c9c57f1

…om DSE

CC5 Fix CreateTableStatementCompactionStrategiesTest

dcb8922

CC5 Fix CQLSSTableWriterClientTest.testWriteWithSAI

84f810d

CC5 Fix IndexNameTest.testQuotedAndLongColumnNames after port of CNDB…

c15a353

…-12503

CNDB 13650: fix SSTableCorruptionDetectionTest.testSSTableScanner (#1690

6d68a5b

) ### What is the issue ChunkCache needs to invalidate the test file with invalidateFileNow

CC5: IndexStatusManagerTest accomodates INDEX_BUILD_IN_PROGRESS

e9a175a

CNDB-13785: fix IndexComponentDiscover#StateBuilder to pass MB instea…

568e72c

…d of bytes (#1699)

CC5 cleanup merge conflict markers after porting CNDB-13457

124be37

CC5: move first delete to deleteWithConfirm, add debugging

e4fd01d

CC5: try a longer sleep to generate more contention in TrieMemtableMe…

af0921a

…tricsTest

CC5: don't confirm second delete in CommitLogSegmentManagerCDCTest

a55f060

CNDB-13766: Upgrade to jvector 4.0.0-beta.4 (#1698)

825ecaf

Relates to riptano/cndb#13766 Brings in several improvements. CNDB PR: riptano/cndb#13789

CNDB-13745: Fix checksum for IncrementalChecksumSequentialWriter edge…

2fdd3ee

… case

michaeljmarshall requested review from djatnieks, jacek-lewandowski and blambov May 9, 2025 04:31

michaeljmarshall self-assigned this May 9, 2025

djatnieks force-pushed the main-5.0 branch from 01bd2fc to 6e759fa Compare May 28, 2025 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-13745: Fix checksum for IncrementalChecksumSequentialWriter edge case #1726

CNDB-13745: Fix checksum for IncrementalChecksumSequentialWriter edge case #1726

Uh oh!

michaeljmarshall commented May 9, 2025

Uh oh!

github-actions bot commented May 9, 2025

Uh oh!

blambov commented May 9, 2025 •

edited

Loading

Uh oh!

michaeljmarshall commented May 9, 2025

Uh oh!

djatnieks commented May 12, 2025

Uh oh!

Uh oh!

CNDB-13745: Fix checksum for IncrementalChecksumSequentialWriter edge case #1726

Are you sure you want to change the base?

CNDB-13745: Fix checksum for IncrementalChecksumSequentialWriter edge case #1726

Uh oh!

Conversation

michaeljmarshall commented May 9, 2025

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented May 9, 2025

Checklist before you submit for review

Uh oh!

blambov commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaeljmarshall commented May 9, 2025

Uh oh!

djatnieks commented May 12, 2025

Uh oh!

Uh oh!

blambov commented May 9, 2025 •

edited

Loading