-
Notifications
You must be signed in to change notification settings - Fork 97
Port AtomicLocal integration to GPU #1163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mfherbst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion. I think it becomes increasingly clear that local_potential_fourier(element, p) is not the right interface. I think it's better to fix that rather than adding yet another level of indirection with the atomic_local_inner_loop!.
What I have in mind could be an interface where one gets multiple ps at once or something similar, perhaps in combination with an in-place version, where one is supposed to place it into CPU or GPU memory. That allows specialisation directly at the level of the PspUpf implementation.
What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.
What do you think @abussy ?
src/gpu/local.jl
Outdated
| end | ||
| end | ||
|
|
||
| ints_cpu = to_cpu(ints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. The only reason this is on the CPU is because we needed it for something. Now essentially you put form_factors on the CPU only to put it on the GPU once this function call is over, right ? Can this not directly be a GPU array in the GPU version of the code ?
I fully agree. The proposed solution happens to be the one with the least amount of change to the current code, but some change would allow a much smoother integration. As you suggest, allowing |
Could we fix this on the CPU somehow? I wonder what exactly makes them so slow. In principle it should only be twice as slow. |
There is a whole machinery to only perform integrations for unique values of |
|
This last commit is a proof of concept and not a final product. It illustrates how one might implement the loop over To make it general, one needs to:
This approach has the advantage of having a single code base running on both CPU and GPU, and not over complicating the high level calls. However, the low-level implementation of the integrals become a bit awkward. Before proceeding, I'd like to make sure we agree on the way forward. |
This PR ports the integration of the
AtomicLocalform factors to the GPU.In standard calculations, the instantiation of the
AtomicLocalterm in thePlaneWaveBasisis generally negligible. However, when ForwardDiff Duals are involved, this is an other story. In particular, for stress calculations on the GPU (not yet generally available, see JuliaMolSim/DftFunctionals.jl#23), this step becomes vastly more expensive than anything else.For efficiency, the form factors are only calculated for a set of
Gvector norms, rather than for each individualG. With ForwardDiff Duals, 2Gvectors with the same value might differ in their partials and have a different norm. As a result, a lot more integrations must take place.This PR introduces a special case for the GPU, where a
map!kernel is launched over theGvector norms, and each GPU thread takes care of an integration on the atomic grid. This only triggers in the special case of UPF pseudos on the GPU.Notes:
PlaneWaveBasis, its impact there will be smallerAtomicLocalterm will remain expensive on the CPU. Since everything else also gets more expensive, it is less of an obvious bottleneck in that case. However, I think it would be trivial to parallelize it over Julia threadsElementPsp{<:PspUpf}type is far from beingisbits, I did not manage to come up with a single code to run on both CPU and GPU.Xcterm suffers from a similar problem, to be treated in a separate PR