Port stress/response calculations to the GPU #1187

abussy · 2025-11-07T12:43:16Z

This PR enables ForwardDiff calculations (stress and response) on the GPU. Main changes are:

Data transfer from/to the device where necessary
Various small changes to avoid GPU compiler confusion (e.g. see changes in src/workarounds/forwarddiff_rules.jl)
CPU fall-backs for all XC operations taking place in the DftFunctionals.jl package
Refactoring of the ForwardDiff tests, such that all tests can be run on various architectures (CPU, CUDA, AMDGPU)

With this PR, all ForwardDiff workflows currently tested on the CPU successfully run on both NVIDIA and AMD GPUs.

Future improvements will come with:

PR Increase GPU robustness of LOBPCG #1185 for the tests to consistently finish on GPUs (right now, they regularly fail due to Cholesky instability on the GPU)
PR Make DftFunctionals.jl types GPU compatible DftFunctionals.jl#23 to pave the way for XC operations on the GPU
PR Port AtomicLocal integration to GPU #1163 for more efficient PlaneWaveBasis instantiation

abussy · 2025-11-12T16:36:56Z

Merged master. Adapted tests to the refactoring brought by #1182.

Additionally, removed this problematic bit of code in ext/DFTKAMDGPUExt.jl:

# Enable comparisons of Duals on AMD GPUs
_val(x) = x
_val(x::Dual) = _val(ForwardDiff.value(x))
function Base.:<(x::Dual{T,V,N},
                 y::Dual{T,V,N}) where {T,V,N}
    _val(x) < _val(y)
end
function Base.:>(x::Dual{T,V,N},
                 y::Dual{T,V,N}) where {T,V,N}
    _val(x) > _val(y)
end

It turns out that comparison of Duals does not take place on the GPU, as long as all XC operations are done on the CPU. This might become a concern again in the future, once DftFunctionals.jl is refactored.

mfherbst · 2025-11-12T19:50:21Z

ext/DFTKAMDGPUExt.jl

+import ForwardDiff
+import ForwardDiff: Dual


Can also be removed, right ?

mfherbst · 2025-11-12T19:52:09Z

src/gpu/gpu_arrays.jl

+# Make sure that computations done by DftFunctionals.jl are done on the CPU (until refactoring)
+for fun in (:potential_terms, :kernel_terms)
+    @eval function DftFunctionals.$fun(fun::DispatchFunctional, ρ::AT,
+                                       args...) where {AT <: AbstractGPUArray}
+        # Fallback implementation for the GPU: Transfer to the CPU and run computation there
+        cpuify(::Nothing) = nothing
+        cpuify(x::AbstractArray) = Array(x)
+        $fun(fun, Array(ρ), cpuify.(args)...)
+    end


If we have this, do we need the above version at all (which is simply more specific to Float64, right ?

Right, we probably only need the most general workaround. The thought process was that, once DftFunctionals.jl is refactored to run on the GPU, we can simply remove the AbstractGPUArray definition.

It's probably best to keep the code in the cleanest state in the meantime though, I'll see to it.

mfherbst · 2025-11-12T19:56:11Z

src/workarounds/forwarddiff_rules.jl

+            Complex(Dual{T}(real(ψnk), real.(δψnk)),
+                    Dual{T}(imag(ψnk), imag.(δψnk)))


Is this the right thing to do ? Does this not loose the tags ?
@niklasschmitz @Technici4n may have better insights here.

Yes that seems to loose the tags

But isn't the T of Dual{T,V,N}} the tag? I am confused here

Oh right T is the tag and not the value type. Seems fine then, this is just making the T,V,N explicit instead of letting the compiler infer them. I am surprised that this is required though. Would the root problem be the type-unstable nature of construct_value(basis)?

Would

Complex(Dual{T,V,N}(real(ψnk), real.(δψnk)), Dual{T,V,N}(imag(ψnk), imag.(δψnk)))

solve the issue?

Also, what about here? Don't we have a similar situation?

DFTK.jl/src/workarounds/forwarddiff_rules.jl

Lines 321 to 326 in 3551ad3

function LinearAlgebra.norm(x::SVector{S,<:Dual{Tg,T,N}}) where {S,Tg,T,N}

x_value = ForwardDiff.value.(x)

y = norm(x_value)

dy = ntuple(j->real(dot(x_value, ForwardDiff.partials.(x,j))) * pinv(y), N)

Dual{Tg}(y, dy)

end

Ah no, I mean that Dual{T} is fine. I misread it the first time. What is weird to me is why this is necessary, given that the gpu compiler isn't operating directly on this method? (Or is it?)

Looking at @code_typed might be quite insightful

Sorry, I didn't see @Technici4n last message when I wrote the above.

The GPU compilers requires this to be able to compile. I am not sure what the root cause is, but type instability is a likely candidate. Generally, the GPU compiler is rather bad at type inference.

mfherbst · 2025-11-12T19:58:53Z

src/terms/xc.jl

        end

+        # Ensure functionals from DftFunctionals are sent to the CPU, until DftFunctionals.jl is refactored
+        function DftFunctionals.$fun(fun::DftFunctionals.Functional, density::LibxcDensities)


This looks weird to me. Why is this needed on top of the above ? Should the types not somehow depend on a GPU type here ?

The tricky bit is to avoid ambiguity with the internal definitions of DftFunctionals.jl.

Adding the following to src/gpu/gpu_array.jl leads to ambiguity, because it does not specialize on the type of functional (:lda, :gga, or :mgga):

for fun in (:potential_terms, :kernel_terms) @eval function DftFunctionals.$fun(fun::DftFunctionals.Functional, ρ::AT, args...) where {AT <: AbstractGPUArray} # Fallback implementation for the GPU: Transfer to the CPU and run computation there cpuify(::Nothing) = nothing cpuify(x::AbstractArray) = Array(x) $fun(fun, Array(ρ), cpuify.(args)...) end end

Either I write the above for each functional type (lot of code duplication), or I parametrize it with a second loop over functional types, e.g.:

for fun in (:potential_terms, :kernel_terms), ftype in (:lda, :gga, :mgga) @eval function DftFunctionals.$fun(fun::DftFunctionals.Functional{$(QuoteNode(ftype))}, ρ::AT, args...) where {AT <: AbstractGPUArray} # Fallback implementation for the GPU: Transfer to the CPU and run computation there cpuify(::Nothing) = nothing cpuify(x::AbstractArray) = Array(x) $fun(fun, Array(ρ), cpuify.(args)...) end end

I don't like any of these alternative very much. I think the current solution carries a very clear message: anything DftFunctionals related goes to the CPU.

mfherbst · 2025-11-12T19:59:46Z

src/workarounds/forwarddiff_rules.jl

+    copyto!(y, _mul(p, x))
 end
-function Base.:*(p::AbstractFFTs.Plan, x::AbstractArray{<:Complex{<:Dual{Tg}}}) where {Tg}
+function _mul(p::AbstractFFTs.Plan, x::AbstractArray{<:Complex{<:Dual{Tg}}}) where {Tg}


Again this feels strange and is surprising to me. Why did you need this ?

Without this workaround, the GPU compiler throws an invalid LLVM IR error during stress calculations. I think there is confusion around which method of Base.:* to use, but I don't understand why.

mfherbst · 2025-11-12T20:06:45Z

test/forwarddiff.jl

@@ -1,492 +1,66 @@
-# Wrappers around ForwardDiff to fix tags and reduce compilation time.


I like the general idea to make these tests more generic, but I don't like the splitup. Because this way it's one more level of indirection I have to follow to see the actual test code (from the file and line information printed in the test suite output).

I recall I wrote test functions elsewhere that actually internally create multiple @testcase instances. Like you call just one function for the CPU tests and it internally actually creates the "templated" testcases with CPU architecture. Would that be an option here ? If not, reducing code duplication clearly wins and this is likely a good option.

Also for the cases that do not depend on architecture, I'd avoid the indirection alltogether and just keep the test code here.

But I have to think about this a bit more, if I can think of ideas how to improve the cut between the multiple files.

abussy mentioned this pull request Nov 11, 2025

Ensure contiguous occupations #1189

Merged

abussy added 2 commits November 12, 2025 10:59

Port stress/response calculations to the GPU

05e7d1f

Test refactoring due to master merge

bc593e0

abussy force-pushed the stress_gpu branch from 537eacd to bc593e0 Compare November 12, 2025 16:31

mfherbst reviewed Nov 12, 2025

View reviewed changes

Clean-up gpu_arrays.jl + use clearer name for parametreized Dual tags

f6ffd91

		Complex(Dual{T}(real(ψnk), real.(δψnk)),
		Dual{T}(imag(ψnk), imag.(δψnk)))

	function LinearAlgebra.norm(x::SVector{S,<:Dual{Tg,T,N}}) where {S,Tg,T,N}
	x_value = ForwardDiff.value.(x)
	y = norm(x_value)
	dy = ntuple(j->real(dot(x_value, ForwardDiff.partials.(x,j))) * pinv(y), N)
	Dual{Tg}(y, dy)
	end

		@@ -1,492 +1,66 @@
		# Wrappers around ForwardDiff to fix tags and reduce compilation time.

Port stress/response calculations to the GPU #1187

Are you sure you want to change the base?

Port stress/response calculations to the GPU #1187

Uh oh!

Conversation

abussy commented Nov 7, 2025

Uh oh!

abussy commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abussy Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfherbst Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abussy Nov 13, 2025 •

edited

Loading

mfherbst Nov 12, 2025 •

edited

Loading