Skip to content

Conversation

@DiamonDinoia
Copy link
Collaborator

@paquiteau and @Lenoush have noticed that alloca made things slower in their benchmarks while greatly reducing memory consumption.

Details are in #570 and mind-inria/mri-nufft-benchmark#5

Instead of using opts.gpu_* to switch with the old implementation it is better to use kernel dispatch and have pre-compiled kernels for the various scenarios. As per CPU code. One less parameter that the user has to worry about and it can obtain both higher performance and low memory consumption at the same time.

@paquiteau, @Lenoush can you benchmark this branch and let us know how it fares? I could not measure a meaningful difference with my custom code.

@DiamonDinoia DiamonDinoia requested a review from blackwer January 30, 2025 20:53
@DiamonDinoia DiamonDinoia changed the title Removing alloca to GPU Removing alloca on GPU Jan 30, 2025
@DiamonDinoia DiamonDinoia requested a review from janden January 30, 2025 21:02
@paquiteau
Copy link

paquiteau commented Jan 31, 2025

Hello @DiamonDinoia ! Interesting stuff :)
I will have a look in the next coming days with @chaithyagr as well

PS: @Lenoush's contract ended so she does not work on nuffts anymore

Copy link
Member

@blackwer blackwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. A few stale comments should be updated and then I'd be happy to merge. Marking as approved since they're not really critical

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why the constants are different around the last two digits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are computed up to eps. The other digits are noise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if you had insight into why it wasn't deterministic. Maybe a different thread level for BLAS? MATLAB internals changing vectorization levels? It's just weird -- not really concerning

Methods available:
(1) Non-uniform points driven
(2) Subproblem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe get rid of the unsupported "subproblem" method comment while we're touching this code. We're already refactoring some of this stuff some it seems like a good time to clean the comments to be up to date

#else
T ker1[MAX_NSPREAD];
#endif
T ker1[ns];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stale comment

#else
T ker1[MAX_NSPREAD];
#endif
T ker1[ns];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stale comment

d_plan->opts.gpu_binsizey, d_plan->opts.gpu_binsizez);

if (d_plan->opts.gpu_kerevalmeth) {
if (const auto finufft_err =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on factoring this out

@chaithyagr
Copy link
Contributor

Hey I will perhaps get some time on this tomorrow, can u give some context on what is the change here exactly and against what version would you prfer us to benchmark it? So sorry for the delay.

@DiamonDinoia
Copy link
Collaborator Author

Hey I will perhaps get some time on this tomorrow, can u give some context on what is the change here exactly and against what version would you prfer us to benchmark it? So sorry for the delay.

Referring to this discussion:
GitHub Issue Comment

When building cuFINUFFT natively on your machine, you observed a performance regression but lower memory utilization. This was caused by the use of dynamic stack allocation to allocate memory for the kernel tensors. Previously, the approach was to allocate MAX_ARRAY on the stack. However, since the GPU stack is relatively small, this would spill over into global memory, leading to higher memory utilization.

In my benchmarks, I did not observe a performance regression when using alloca (dynamic stack allocation), which differs from your experience.

To achieve the best of both worlds, I am now avoiding alloca and instead using a template recursion trick to generate different CUDA kernels based on varying spreading width values. This approach eliminates the need for alloca while ensuring that no more memory is used than necessary.

Before integrating this change, I’d like to see how it impacts your benchmarks. Let me know your thoughts!

Cheers,
Marco

Copy link
Collaborator

@janden janden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I'm concerned, although I confess some of this std::forward is a bit over my head.

@DiamonDinoia
Copy link
Collaborator Author

Looks good as far as I'm concerned, although I confess some of this std::forward is a bit over my head.

Just think of it as an r-value cast. T -> T&& so to avoid copies :)

@DiamonDinoia
Copy link
Collaborator Author

Merging for now. if @chaithyagr finds issue I'll open a separate issue or revert.

@DiamonDinoia DiamonDinoia merged commit 73412b4 into flatironinstitute:master Feb 11, 2025
146 checks passed
@DiamonDinoia DiamonDinoia deleted the removing-alloca branch April 22, 2025 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants