Removing alloca on GPU #618

DiamonDinoia · 2025-01-30T20:53:28Z

@paquiteau and @Lenoush have noticed that alloca made things slower in their benchmarks while greatly reducing memory consumption.

Details are in #570 and mind-inria/mri-nufft-benchmark#5

Instead of using opts.gpu_* to switch with the old implementation it is better to use kernel dispatch and have pre-compiled kernels for the various scenarios. As per CPU code. One less parameter that the user has to worry about and it can obtain both higher performance and low memory consumption at the same time.

@paquiteau, @Lenoush can you benchmark this branch and let us know how it fares? I could not measure a meaningful difference with my custom code.

paquiteau · 2025-01-31T08:58:01Z

Hello @DiamonDinoia ! Interesting stuff :)
I will have a look in the next coming days with @chaithyagr as well

PS: @Lenoush's contract ended so she does not work on nuffts anymore

blackwer

Looks good to me. A few stale comments should be updated and then I'd be happy to merge. Marking as approved since they're not really critical

blackwer · 2025-02-04T16:03:24Z

include/cufinufft/contrib/ker_horner_allw_loop.inc

Any idea why the constants are different around the last two digits?

They are computed up to eps. The other digits are noise.

Just wondering if you had insight into why it wasn't deterministic. Maybe a different thread level for BLAS? MATLAB internals changing vectorization levels? It's just weird -- not really concerning

blackwer · 2025-02-04T16:32:33Z

src/cuda/1d/interp1d_wrapper.cu

+
+   Methods available:
+      (1) Non-uniform points driven
+      (2) Subproblem


Maybe get rid of the unsupported "subproblem" method comment while we're touching this code. We're already refactoring some of this stuff some it seems like a good time to clean the comments to be up to date

blackwer · 2025-02-04T16:34:37Z

src/cuda/1d/spreadinterp1d.cuh

-#else
-  T ker1[MAX_NSPREAD];
-#endif
+  T ker1[ns];


Remove stale comment

blackwer · 2025-02-04T16:34:58Z

src/cuda/1d/spreadinterp1d.cuh

-#else
-  T ker1[MAX_NSPREAD];
-#endif
+  T ker1[ns];


Remove stale comment

blackwer · 2025-02-04T16:39:21Z

src/cuda/2d/interp2d_wrapper.cu

                                d_plan->opts.gpu_binsizey, d_plan->opts.gpu_binsizez);

  if (d_plan->opts.gpu_kerevalmeth) {
+    if (const auto finufft_err =


Nice catch on factoring this out

chaithyagr · 2025-02-10T21:03:46Z

Hey I will perhaps get some time on this tomorrow, can u give some context on what is the change here exactly and against what version would you prfer us to benchmark it? So sorry for the delay.

DiamonDinoia · 2025-02-10T21:31:09Z

Hey I will perhaps get some time on this tomorrow, can u give some context on what is the change here exactly and against what version would you prfer us to benchmark it? So sorry for the delay.

Referring to this discussion:
GitHub Issue Comment

When building cuFINUFFT natively on your machine, you observed a performance regression but lower memory utilization. This was caused by the use of dynamic stack allocation to allocate memory for the kernel tensors. Previously, the approach was to allocate MAX_ARRAY on the stack. However, since the GPU stack is relatively small, this would spill over into global memory, leading to higher memory utilization.

In my benchmarks, I did not observe a performance regression when using alloca (dynamic stack allocation), which differs from your experience.

To achieve the best of both worlds, I am now avoiding alloca and instead using a template recursion trick to generate different CUDA kernels based on varying spreading width values. This approach eliminates the need for alloca while ensuring that no more memory is used than necessary.

Before integrating this change, I’d like to see how it impacts your benchmarks. Let me know your thoughts!

Cheers,
Marco

janden

Looks good as far as I'm concerned, although I confess some of this std::forward is a bit over my head.

DiamonDinoia · 2025-02-11T17:10:26Z

Looks good as far as I'm concerned, although I confess some of this std::forward is a bit over my head.

Just think of it as an r-value cast. T -> T&& so to avoid copies :)

DiamonDinoia · 2025-02-11T17:11:35Z

Merging for now. if @chaithyagr finds issue I'll open a separate issue or revert.

DiamonDinoia added 4 commits January 30, 2025 14:39

removing alloca in 1D

ba53244

Applied to 2D and 3D

7ea80c5

re-added comments

51c8f0a

using templates in kernel

d555578

DiamonDinoia requested a review from blackwer January 30, 2025 20:53

DiamonDinoia added 2 commits January 30, 2025 15:59

cleanup

00fa73e

updated changelog

d7f1f85

DiamonDinoia changed the title ~~Removing alloca to GPU~~ Removing alloca on GPU Jan 30, 2025

DiamonDinoia requested a review from janden January 30, 2025 21:02

shmem should be set once

5dc2058

blackwer approved these changes Feb 4, 2025

View reviewed changes

review-comments

9c909b1

janden approved these changes Feb 11, 2025

View reviewed changes

DiamonDinoia merged commit 73412b4 into flatironinstitute:master Feb 11, 2025
146 checks passed

DiamonDinoia deleted the removing-alloca branch April 22, 2025 00:37

Removing alloca on GPU #618

Removing alloca on GPU #618

Uh oh!

Conversation

DiamonDinoia commented Jan 30, 2025

Uh oh!

paquiteau commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blackwer left a comment

Choose a reason for hiding this comment

Uh oh!

blackwer Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

blackwer Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

blackwer Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

blackwer Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

blackwer Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

blackwer Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

chaithyagr commented Feb 10, 2025

Uh oh!

DiamonDinoia commented Feb 10, 2025

Uh oh!

janden left a comment

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia commented Feb 11, 2025

Uh oh!

DiamonDinoia commented Feb 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

paquiteau commented Jan 31, 2025 •

edited

Loading