CNDB-14077: Reduce compaction thread pool size to match num of physical cores #1736

michaeljmarshall · 2025-05-15T19:05:23Z

What is the issue

Relates to https://github.com/riptano/cndb/issues/14077, but doesn't necessarily solve it.

What does this PR fix and why was it fixed

When we are parallelizing vector graph insertions, we want to set the number of threads to the physical cores, not the virtual ones. This should improve efficiency and in my limited testing reduces the number of concurrent updates to the ConcurrentNeighborMap during a build of the sift 1M dataset.

My data:
For normal graph construction before this change, I saw 19412 retries in the insertDiverse method.
For normal graph construction with this change, I saw 9371 retries in the insertDiverse method.
For normal graph + hierarchy before this change, I saw 22500 retries in the insertDiverse method.
For graph {'similarity_function' : 'euclidean', 'enable_hierarchy': 'true', 'construction_beam_width': '200', 'maximum_node_connections': '32'}, without this change I saw 58773 retries and with the change I saw 27775.

The graph construction times fluctuated with the change, but I'm not sure time matters significantly since my mac does not have simd:

$ sysctl -n machdep.cpu.features
FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

…al cores

github-actions · 2025-05-15T19:05:40Z

### What is the issue Fixes: riptano/cndb#14160 ### What does this PR fix and why was it fixed The loop is supposed to loop until the deadline, not after the deadline. The test fails without the change. (cherry picked from commit cd11ec8)

eolivelli · 2025-05-23T10:16:54Z

src/java/org/apache/cassandra/index/sai/disk/v1/SegmentBuilder.java

+    /** for parallelism within a single compaction
+     *  see comments to JVector PhysicalCoreExecutor -- HT tends to cause contention for the SIMD units
+     */
+    public static final ExecutorService compactionExecutor = new DebuggableThreadPoolExecutor(Runtime.getRuntime().availableProcessors() / 2,


what about making this configurable ? in case we need to rollback

fwiw, I copied this logic from:

cassandra/src/java/org/apache/cassandra/index/sai/disk/vector/CompactionGraph.java

Lines 104 to 108 in 61f57a6

// see comments to JVector PhysicalCoreExecutor -- HT tends to cause contention for the SIMD units

private static final ForkJoinPool compactionSimdPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors() / 2,

new LowPriorityThreadFactory(),

null,

false);

Are we looking to make all of these configurable?

eolivelli

patch LGTM

but I think that the branch contains an additional commit from other patches

Accidentally cherry picked to this branch. This reverts commit a15ceca.

michaeljmarshall · 2025-05-30T17:38:31Z

Companion CNDB test PR: https://github.com/riptano/cndb/pull/14297

eolivelli

LGTM

sonarqubecloud · 2025-06-13T22:05:18Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
96.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-06-13T22:09:58Z

❌ Build ds-cassandra-pr-gate/PR-1736 rejected by Butler

1 new test failure(s) in 4 builds
See build details here

Found 1 new test failures

Test	Explanation	Branch history	Upstream history
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS...	regression	🔴🔴🔴🔵	🔵🔵🔵🔵🔵🔵🔵

Found 6 known test failures

tlwillke · 2025-07-10T14:35:46Z

Would like to see some perf data for ~10M scale datasets, with and without SIMD. When an architecture like AVX-512 is enabled, it makes sense to thread based on the number of physical (not virtual / hyperthreaded) CPUs. But what about when SIMD is not enabled or available? Then use hyperthreading? Perhaps the threadpool size should depend on SIMD enabling.

CNDB-14077: Reduce compaction thread pool size to match num of physic…

812e7ab

…al cores

michaeljmarshall requested a review from tlwillke May 15, 2025 19:05

michaeljmarshall self-assigned this May 15, 2025

eolivelli reviewed May 23, 2025

View reviewed changes

Revert "CNDB-14160: Fix IndexContext#getReferencedView (#1744)"

6c96b35

Accidentally cherry picked to this branch. This reverts commit a15ceca.

eolivelli approved these changes Jun 5, 2025

View reviewed changes

michaeljmarshall added 3 commits June 12, 2025 13:32

Add config: cassandra.sai.compaction.executor.threads

9eb46e7

Temp: add metrics (will develop better metrics later)

139a500

Fix threadsafety issue in metrics

543564d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-14077: Reduce compaction thread pool size to match num of physical cores #1736

CNDB-14077: Reduce compaction thread pool size to match num of physical cores #1736

Uh oh!

michaeljmarshall commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

eolivelli May 23, 2025

Uh oh!

michaeljmarshall May 28, 2025

Uh oh!

eolivelli left a comment

Uh oh!

michaeljmarshall commented May 30, 2025

Uh oh!

eolivelli left a comment

Uh oh!

sonarqubecloud bot commented Jun 13, 2025

Uh oh!

cassci-bot commented Jun 13, 2025

Uh oh!

tlwillke commented Jul 10, 2025

Uh oh!

Uh oh!

	// see comments to JVector PhysicalCoreExecutor -- HT tends to cause contention for the SIMD units
	private static final ForkJoinPool compactionSimdPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors() / 2,
	new LowPriorityThreadFactory(),
	null,
	false);

CNDB-14077: Reduce compaction thread pool size to match num of physical cores #1736

Are you sure you want to change the base?

CNDB-14077: Reduce compaction thread pool size to match num of physical cores #1736

Uh oh!

Conversation

michaeljmarshall commented May 15, 2025

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented May 15, 2025

Checklist before you submit for review

Uh oh!

eolivelli May 23, 2025

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall May 28, 2025

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall commented May 30, 2025

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jun 13, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Jun 13, 2025

❌ Build ds-cassandra-pr-gate/PR-1736 rejected by Butler

Found 1 new test failures

Found 6 known test failures

Uh oh!

tlwillke commented Jul 10, 2025

Uh oh!

Uh oh!