Skip to content

Optimize Gaussian tile intersection for mGPU#446

Open
matthewdcong wants to merge 1 commit intoopenvdb:mainfrom
matthewdcong:mgpu_isect_prefetch
Open

Optimize Gaussian tile intersection for mGPU#446
matthewdcong wants to merge 1 commit intoopenvdb:mainfrom
matthewdcong:mgpu_isect_prefetch

Conversation

@matthewdcong
Copy link
Contributor

@matthewdcong matthewdcong commented Feb 6, 2026

  1. Using tilesPerGaussianCumsum, we can exactly prefetch the range of intersection keys and values needed for the subsequent computeGaussianTileIntersections kernel. This significantly improves the performance and variance in execution time of the kernel, going from 15 to 30ms to consistently 3ms. This results in an end to end performance increase of about 7-8%.
  2. The overhead of the prefetch when merging keys in the multi-GPU radix sort was larger than the penalty occurred for (rare) page faults. Removing the prefetch marginally increases performance.
  3. Some small const fixes

Signed-off-by: Matthew Cong <mcong@nvidia.com>
@matthewdcong matthewdcong requested a review from a team as a code owner February 6, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant