Skip to content

Shared memory optimizations for Gaussian rasterization#554

Open
matthewdcong wants to merge 2 commits intoopenvdb:mainfrom
matthewdcong:smem_features_forward_pass
Open

Shared memory optimizations for Gaussian rasterization#554
matthewdcong wants to merge 2 commits intoopenvdb:mainfrom
matthewdcong:smem_features_forward_pass

Conversation

@matthewdcong
Copy link
Copy Markdown
Contributor

  1. Forward rasterization does not currently store features in shared memory. As the problem size becomes larger (more intersections with each Gaussian), the cost of an unconditional global load is outweighed by the shared memory reuse.
  2. In addition, we cull loads for Gaussians with an opacity less than the threshold necessary for a Gaussian to be valid in the volume rendering pass. This optimization applies to the forward and backwards pass.

In profiling, this reduces a 17m 20s single-GPU reconstruction to 16m and 48s, leading to an approximately >3% speedup.

Signed-off-by: Matthew Cong <mcong@nvidia.com>
Signed-off-by: Matthew Cong <mcong@nvidia.com>
@matthewdcong matthewdcong requested a review from a team as a code owner March 18, 2026 06:08
Copy link
Copy Markdown
Contributor

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern.


// Thread blocks cooperatively cache a tile of Gaussians in shared memory
const uint32_t sharedMem = getSharedMemRequirements<ScalarType>(tileSize);
const uint32_t sharedMem = getSharedMemRequirements<ScalarType>(NUM_CHANNELS, tileSize);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 issue: ‏Shouldn't this be NUM_SHARED_CHANNELS? Also what happens if the number of channels is too large to fit all the features in shared memory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no NUM_SHARED_CHANNELS in the forwards pass (it's just NUM_CHANNELS) because there isn't chunking implemented. So assuming unlimited shared memory, this would be correct as written.

That being said, the lack of chunking is likely why the tests are failing for large feature depths so I'll have to add that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants