Port AtomicLocal integration to GPU #1163

abussy · 2025-10-07T15:26:05Z

This PR ports the integration of the AtomicLocal form factors to the GPU.

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story. In particular, for stress calculations on the GPU (not yet generally available, see JuliaMolSim/DftFunctionals.jl#23), this step becomes vastly more expensive than anything else.

For efficiency, the form factors are only calculated for a set of G vector norms, rather than for each individual G. With ForwardDiff Duals, 2 G vectors with the same value might differ in their partials and have a different norm. As a result, a lot more integrations must take place.

This PR introduces a special case for the GPU, where a map! kernel is launched over the G vector norms, and each GPU thread takes care of an integration on the atomic grid. This only triggers in the special case of UPF pseudos on the GPU.

Notes:

This PR will drastically speed up stress calculations on the GPU. While the new function will also be used for the instantiation of a standard PlaneWaveBasis, its impact there will be smaller
The AtomicLocal term will remain expensive on the CPU. Since everything else also gets more expensive, it is less of an obvious bottleneck in that case. However, I think it would be trivial to parallelize it over Julia threads
Due to the complex dispatching taking place, and the fact that the ElementPsp{<:PspUpf} type is far from being isbits, I did not manage to come up with a single code to run on both CPU and GPU.
The instantiation of the Xc term suffers from a similar problem, to be treated in a separate PR

mfherbst

Thanks for this suggestion. I think it becomes increasingly clear that local_potential_fourier(element, p) is not the right interface. I think it's better to fix that rather than adding yet another level of indirection with the atomic_local_inner_loop!.

What I have in mind could be an interface where one gets multiple ps at once or something similar, perhaps in combination with an in-place version, where one is supposed to place it into CPU or GPU memory. That allows specialisation directly at the level of the PspUpf implementation.

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

What do you think @abussy ?

mfherbst · 2025-10-08T01:11:09Z

src/gpu/local.jl

+        end
+    end
+
+    ints_cpu = to_cpu(ints)


This is weird. The only reason this is on the CPU is because we needed it for something. Now essentially you put form_factors on the CPU only to put it on the GPU once this function call is over, right ? Can this not directly be a GPU array in the GPU version of the code ?

abussy · 2025-10-08T07:44:55Z

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

I fully agree. The proposed solution happens to be the one with the least amount of change to the current code, but some change would allow a much smoother integration.

As you suggest, allowing local_potential_fourier(element, p) to treat multiple ps at once would pass the loop logic down to the specific pseudo implementation. If we remain general and pass an AbstractArray of ps, we might even be able to write architecture agnostic code. I'll look into it.

Technici4n · 2025-10-08T08:41:12Z

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story.

Could we fix this on the CPU somehow? I wonder what exactly makes them so slow. In principle it should only be twice as slow.

abussy · 2025-10-08T09:07:58Z

I wonder what exactly makes them so slow. In principle it should only be twice as slow.

There is a whole machinery to only perform integrations for unique values of norm(G) (see here). When duals are involved, the number of unique norms explodes. My interpretation is that G vectors with the same value can have different partials. So the overall norm is different.

abussy · 2025-10-08T12:38:21Z

This last commit is a proof of concept and not a final product. It illustrates how one might implement the loop over norm(G) at the Psp level (here only implemented for <:PspUpf).

To make it general, one needs to:

Modify all function signatures of element.jl so that arrays are expected, and make sure the underlying implementations match that
Update the rest of the code to this new convention

This approach has the advantage of having a single code base running on both CPU and GPU, and not over complicating the high level calls. However, the low-level implementation of the integrals become a bit awkward.

Before proceeding, I'd like to make sure we agree on the way forward.

Port AtomicLocal integration to GPU

2a84171

mfherbst reviewed Oct 8, 2025

View reviewed changes

Implement loop over ps within Psp

189e18e

abussy mentioned this pull request Oct 13, 2025

Optimize uniform Simpson #1171

Merged

abussy mentioned this pull request Nov 7, 2025

Port stress/response calculations to the GPU #1187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port AtomicLocal integration to GPU #1163

Port AtomicLocal integration to GPU #1163

abussy commented Oct 7, 2025

Uh oh!

mfherbst left a comment

Uh oh!

mfherbst Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025

Uh oh!

Technici4n commented Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Port AtomicLocal integration to GPU #1163

Are you sure you want to change the base?

Port AtomicLocal integration to GPU #1163

Conversation

abussy commented Oct 7, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

mfherbst Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

abussy commented Oct 8, 2025

Uh oh!

Technici4n commented Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025

Uh oh!

abussy commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abussy commented Oct 8, 2025 •

edited

Loading