Skip to content

CNDB-14077: Reduce compaction thread pool size to match num of physical cores #1736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

michaeljmarshall
Copy link
Member

What is the issue

Relates to https://github.com/riptano/cndb/issues/14077, but doesn't necessarily solve it.

What does this PR fix and why was it fixed

When we are parallelizing vector graph insertions, we want to set the number of threads to the physical cores, not the virtual ones. This should improve efficiency and in my limited testing reduces the number of concurrent updates to the ConcurrentNeighborMap during a build of the sift 1M dataset.

My data:
For normal graph construction before this change, I saw 19412 retries in the insertDiverse method.
For normal graph construction with this change, I saw 9371 retries in the insertDiverse method.
For normal graph + hierarchy before this change, I saw 22500 retries in the insertDiverse method.
For graph {'similarity_function' : 'euclidean', 'enable_hierarchy': 'true', 'construction_beam_width': '200', 'maximum_node_connections': '32'}, without this change I saw 58773 retries and with the change I saw 27775.

The graph construction times fluctuated with the change, but I'm not sure time matters significantly since my mac does not have simd:

$ sysctl -n machdep.cpu.features
FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C

@michaeljmarshall michaeljmarshall requested a review from tlwillke May 15, 2025 19:05
@michaeljmarshall michaeljmarshall self-assigned this May 15, 2025
Copy link

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

### What is the issue
Fixes: riptano/cndb#14160

### What does this PR fix and why was it fixed
The loop is supposed to loop until the deadline, not after the deadline.

The test fails without the change.

(cherry picked from commit cd11ec8)
/** for parallelism within a single compaction
* see comments to JVector PhysicalCoreExecutor -- HT tends to cause contention for the SIMD units
*/
public static final ExecutorService compactionExecutor = new DebuggableThreadPoolExecutor(Runtime.getRuntime().availableProcessors() / 2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about making this configurable ? in case we need to rollback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I copied this logic from:

// see comments to JVector PhysicalCoreExecutor -- HT tends to cause contention for the SIMD units
private static final ForkJoinPool compactionSimdPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors() / 2,
new LowPriorityThreadFactory(),
null,
false);

Are we looking to make all of these configurable?

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch LGTM

but I think that the branch contains an additional commit from other patches

Accidentally cherry picked to this branch.

This reverts commit a15ceca.
@michaeljmarshall
Copy link
Member Author

Companion CNDB test PR: https://github.com/riptano/cndb/pull/14297

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1736 rejected by Butler


1 new test failure(s) in 4 builds
See build details here


Found 1 new test failures

Test Explanation Branch history Upstream history
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS... regression 🔴🔴🔴🔵 🔵🔵🔵🔵🔵🔵🔵

Found 6 known test failures

@tlwillke
Copy link

Would like to see some perf data for ~10M scale datasets, with and without SIMD. When an architecture like AVX-512 is enabled, it makes sense to thread based on the number of physical (not virtual / hyperthreaded) CPUs. But what about when SIMD is not enabled or available? Then use hyperthreading? Perhaps the threadpool size should depend on SIMD enabling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants