-
Notifications
You must be signed in to change notification settings - Fork 94
Removing alloca on GPU #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing alloca on GPU #618
Conversation
|
Hello @DiamonDinoia ! Interesting stuff :) PS: @Lenoush's contract ended so she does not work on nuffts anymore |
blackwer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A few stale comments should be updated and then I'd be happy to merge. Marking as approved since they're not really critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why the constants are different around the last two digits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are computed up to eps. The other digits are noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if you had insight into why it wasn't deterministic. Maybe a different thread level for BLAS? MATLAB internals changing vectorization levels? It's just weird -- not really concerning
| Methods available: | ||
| (1) Non-uniform points driven | ||
| (2) Subproblem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe get rid of the unsupported "subproblem" method comment while we're touching this code. We're already refactoring some of this stuff some it seems like a good time to clean the comments to be up to date
| #else | ||
| T ker1[MAX_NSPREAD]; | ||
| #endif | ||
| T ker1[ns]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stale comment
| #else | ||
| T ker1[MAX_NSPREAD]; | ||
| #endif | ||
| T ker1[ns]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stale comment
| d_plan->opts.gpu_binsizey, d_plan->opts.gpu_binsizez); | ||
|
|
||
| if (d_plan->opts.gpu_kerevalmeth) { | ||
| if (const auto finufft_err = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch on factoring this out
|
Hey I will perhaps get some time on this tomorrow, can u give some context on what is the change here exactly and against what version would you prfer us to benchmark it? So sorry for the delay. |
Referring to this discussion: When building cuFINUFFT natively on your machine, you observed a performance regression but lower memory utilization. This was caused by the use of dynamic stack allocation to allocate memory for the kernel tensors. Previously, the approach was to allocate MAX_ARRAY on the stack. However, since the GPU stack is relatively small, this would spill over into global memory, leading to higher memory utilization. In my benchmarks, I did not observe a performance regression when using alloca (dynamic stack allocation), which differs from your experience. To achieve the best of both worlds, I am now avoiding alloca and instead using a template recursion trick to generate different CUDA kernels based on varying spreading width values. This approach eliminates the need for alloca while ensuring that no more memory is used than necessary. Before integrating this change, I’d like to see how it impacts your benchmarks. Let me know your thoughts! Cheers, |
janden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as far as I'm concerned, although I confess some of this std::forward is a bit over my head.
Just think of it as an r-value cast. |
|
Merging for now. if @chaithyagr finds issue I'll open a separate issue or revert. |
@paquiteau and @Lenoush have noticed that alloca made things slower in their benchmarks while greatly reducing memory consumption.
Details are in #570 and mind-inria/mri-nufft-benchmark#5
Instead of using opts.gpu_* to switch with the old implementation it is better to use kernel dispatch and have pre-compiled kernels for the various scenarios. As per CPU code. One less parameter that the user has to worry about and it can obtain both higher performance and low memory consumption at the same time.
@paquiteau, @Lenoush can you benchmark this branch and let us know how it fares? I could not measure a meaningful difference with my custom code.