Optimize GPU radix sort with bitmask ranking and multi-element processing by mvaligursky · Pull Request #8529 · playcanvas/engine

mvaligursky · 2026-03-13T15:38:09Z

Optimizes the GPU compute radix sort with several algorithmic improvements, yielding ~70% faster sorting on a 10M-element benchmark (9.2ms → ~2.8ms).

Changes:

Eliminate the local_prefix_sums GPU buffer entirely by computing local ranks in shared memory using per-digit 256-bit bitmasks and hardware countOneBits (popcount) — zero warp divergence
Process 8 elements per thread (2048 elements per workgroup), reducing workgroup count 8× and shrinking the prefix sum hierarchy
Simplify histogram pass to pure atomic counting — removes shared memory ranking that was previously interleaved
Skip key buffer write on the last sorting pass (only values/indices matter for final output)
Pre-fetch values before the workgroup barrier to overlap memory fetch with ranking computation
Fix indirect dispatch mismatch by allocating separate dispatch slots for key generation (256 threads/workgroup) and sorting (2048 elements/workgroup)

API Changes:

New export RADIX_SORT_ELEMENTS_PER_WORKGROUP from ComputeRadixSort module (used by compaction classes to compute correct dispatch sizes)

Performance:

10M-element 20-bit key sort: ~9.2ms → ~2.8ms (~70% improvement)
Reduced GPU memory usage: one fewer storage buffer (local_prefix_sums) per sort instance

…processing Eliminate the local_prefix_sums GPU buffer by computing local ranks in shared memory using per-digit 256-bit bitmasks and hardware popcount. Process 8 elements per thread (2048 per workgroup) to reduce dispatch overhead and shrink the prefix sum tree. Skip key write on the last pass and pre-fetch values before barriers. Fix indirect dispatch mismatch by allocating separate dispatch slots for key generation (256 threads/workgroup) and sorting (2048 elements/workgroup). Made-with: Cursor

mvaligursky self-assigned this Mar 13, 2026

vercel bot deployed to Preview – engine-api-docs March 13, 2026 15:38 View deployment

vercel bot deployed to Preview – engine March 13, 2026 15:39 View deployment

mvaligursky merged commit f19d6ae into main Mar 13, 2026
8 checks passed

mvaligursky deleted the mv-radix-sort-optimize branch March 13, 2026 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GPU radix sort with bitmask ranking and multi-element processing#8529

Optimize GPU radix sort with bitmask ranking and multi-element processing#8529
mvaligursky merged 1 commit intomainfrom
mv-radix-sort-optimize

mvaligursky commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvaligursky commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mvaligursky commented Mar 13, 2026 •

edited

Loading