Skip to content

Optimize GPU radix sort with bitmask ranking and multi-element processing#8529

Merged
mvaligursky merged 1 commit intomainfrom
mv-radix-sort-optimize
Mar 13, 2026
Merged

Optimize GPU radix sort with bitmask ranking and multi-element processing#8529
mvaligursky merged 1 commit intomainfrom
mv-radix-sort-optimize

Conversation

@mvaligursky
Copy link
Contributor

@mvaligursky mvaligursky commented Mar 13, 2026

Optimizes the GPU compute radix sort with several algorithmic improvements, yielding ~70% faster sorting on a 10M-element benchmark (9.2ms → ~2.8ms).

Changes:

  • Eliminate the local_prefix_sums GPU buffer entirely by computing local ranks in shared memory using per-digit 256-bit bitmasks and hardware countOneBits (popcount) — zero warp divergence
  • Process 8 elements per thread (2048 elements per workgroup), reducing workgroup count 8× and shrinking the prefix sum hierarchy
  • Simplify histogram pass to pure atomic counting — removes shared memory ranking that was previously interleaved
  • Skip key buffer write on the last sorting pass (only values/indices matter for final output)
  • Pre-fetch values before the workgroup barrier to overlap memory fetch with ranking computation
  • Fix indirect dispatch mismatch by allocating separate dispatch slots for key generation (256 threads/workgroup) and sorting (2048 elements/workgroup)

API Changes:

  • New export RADIX_SORT_ELEMENTS_PER_WORKGROUP from ComputeRadixSort module (used by compaction classes to compute correct dispatch sizes)

Performance:

  • 10M-element 20-bit key sort: ~9.2ms → ~2.8ms (~70% improvement)
  • Reduced GPU memory usage: one fewer storage buffer (local_prefix_sums) per sort instance

…processing

Eliminate the local_prefix_sums GPU buffer by computing local ranks in
shared memory using per-digit 256-bit bitmasks and hardware popcount.
Process 8 elements per thread (2048 per workgroup) to reduce dispatch
overhead and shrink the prefix sum tree. Skip key write on the last
pass and pre-fetch values before barriers.

Fix indirect dispatch mismatch by allocating separate dispatch slots
for key generation (256 threads/workgroup) and sorting (2048
elements/workgroup).

Made-with: Cursor
@mvaligursky mvaligursky self-assigned this Mar 13, 2026
@mvaligursky mvaligursky merged commit f19d6ae into main Mar 13, 2026
8 checks passed
@mvaligursky mvaligursky deleted the mv-radix-sort-optimize branch March 13, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant