Skip to content

Conversation

@abussy
Copy link
Collaborator

@abussy abussy commented Oct 7, 2025

This PR ports the integration of the AtomicLocal form factors to the GPU.

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story. In particular, for stress calculations on the GPU (not yet generally available, see JuliaMolSim/DftFunctionals.jl#23), this step becomes vastly more expensive than anything else.

For efficiency, the form factors are only calculated for a set of G vector norms, rather than for each individual G. With ForwardDiff Duals, 2 G vectors with the same value might differ in their partials and have a different norm. As a result, a lot more integrations must take place.

This PR introduces a special case for the GPU, where a map! kernel is launched over the G vector norms, and each GPU thread takes care of an integration on the atomic grid. This only triggers in the special case of UPF pseudos on the GPU.

Notes:

  • This PR will drastically speed up stress calculations on the GPU. While the new function will also be used for the instantiation of a standard PlaneWaveBasis, its impact there will be smaller
  • The AtomicLocal term will remain expensive on the CPU. Since everything else also gets more expensive, it is less of an obvious bottleneck in that case. However, I think it would be trivial to parallelize it over Julia threads
  • Due to the complex dispatching taking place, and the fact that the ElementPsp{<:PspUpf} type is far from being isbits, I did not manage to come up with a single code to run on both CPU and GPU.
  • The instantiation of the Xc term suffers from a similar problem, to be treated in a separate PR

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion. I think it becomes increasingly clear that local_potential_fourier(element, p) is not the right interface. I think it's better to fix that rather than adding yet another level of indirection with the atomic_local_inner_loop!.

What I have in mind could be an interface where one gets multiple ps at once or something similar, perhaps in combination with an in-place version, where one is supposed to place it into CPU or GPU memory. That allows specialisation directly at the level of the PspUpf implementation.

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

What do you think @abussy ?

src/gpu/local.jl Outdated
end
end

ints_cpu = to_cpu(ints)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. The only reason this is on the CPU is because we needed it for something. Now essentially you put form_factors on the CPU only to put it on the GPU once this function call is over, right ? Can this not directly be a GPU array in the GPU version of the code ?

@abussy
Copy link
Collaborator Author

abussy commented Oct 8, 2025

What I really do not like here is that gpu/local.jl now essentially contains bleeded-out details of how one is supposed to integrate a PspUpf. Such code should be directly associated to the PspUpf structure and nothing else.

I fully agree. The proposed solution happens to be the one with the least amount of change to the current code, but some change would allow a much smoother integration.

As you suggest, allowing local_potential_fourier(element, p) to treat multiple ps at once would pass the loop logic down to the specific pseudo implementation. If we remain general and pass an AbstractArray of ps, we might even be able to write architecture agnostic code. I'll look into it.

@Technici4n
Copy link
Collaborator

In standard calculations, the instantiation of the AtomicLocal term in the PlaneWaveBasis is generally negligible. However, when ForwardDiff Duals are involved, this is an other story.

Could we fix this on the CPU somehow? I wonder what exactly makes them so slow. In principle it should only be twice as slow.

@abussy
Copy link
Collaborator Author

abussy commented Oct 8, 2025

I wonder what exactly makes them so slow. In principle it should only be twice as slow.

There is a whole machinery to only perform integrations for unique values of norm(G) (see here). When duals are involved, the number of unique norms explodes. My interpretation is that G vectors with the same value can have different partials. So the overall norm is different.

@abussy
Copy link
Collaborator Author

abussy commented Oct 8, 2025

This last commit is a proof of concept and not a final product. It illustrates how one might implement the loop over norm(G) at the Psp level (here only implemented for <:PspUpf).

To make it general, one needs to:

  • Modify all function signatures of element.jl so that arrays are expected, and make sure the underlying implementations match that
  • Update the rest of the code to this new convention

This approach has the advantage of having a single code base running on both CPU and GPU, and not over complicating the high level calls. However, the low-level implementation of the integrals become a bit awkward.

Before proceeding, I'd like to make sure we agree on the way forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants