Optimize Gaussian tile intersection for mGPU by matthewdcong · Pull Request #446 · openvdb/fvdb-core

matthewdcong · 2026-02-06T18:38:44Z

Using tilesPerGaussianCumsum, we can exactly prefetch the range of intersection keys and values needed for the subsequent computeGaussianTileIntersections kernel. This significantly improves the performance and variance in execution time of the kernel, going from 15 to 30ms to consistently 3ms. This results in an end to end performance increase of about 7-8%.
The overhead of the prefetch when merging keys in the multi-GPU radix sort was larger than the penalty occurred for (rare) page faults. Removing the prefetch marginally increases performance.
Some small const fixes

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Optimize Gaussian tile intersection for mGPU

b318f0b

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong requested a review from a team as a code owner February 6, 2026 18:38

matthewdcong requested review from phapalova and sifakis February 6, 2026 18:38

Provide feedback