Skip to content

Conversation

@abussy
Copy link
Collaborator

@abussy abussy commented Nov 7, 2025

This PR enables ForwardDiff calculations (stress and response) on the GPU. Main changes are:

  1. Data transfer from/to the device where necessary
  2. Various small changes to avoid GPU compiler confusion (e.g. see changes in src/workarounds/forwarddiff_rules.jl)
  3. CPU fall-backs for all XC operations taking place in the DftFunctionals.jl package
  4. Refactoring of the ForwardDiff tests, such that all tests can be run on various architectures (CPU, CUDA, AMDGPU)

With this PR, all ForwardDiff workflows currently tested on the CPU successfully run on both NVIDIA and AMD GPUs.

Future improvements will come with:

@abussy
Copy link
Collaborator Author

abussy commented Nov 12, 2025

Merged master. Adapted tests to the refactoring brought by #1182.

Additionally, removed this problematic bit of code in ext/DFTKAMDGPUExt.jl:

# Enable comparisons of Duals on AMD GPUs
_val(x) = x
_val(x::Dual) = _val(ForwardDiff.value(x))
function Base.:<(x::Dual{T,V,N},
                 y::Dual{T,V,N}) where {T,V,N}
    _val(x) < _val(y)
end
function Base.:>(x::Dual{T,V,N},
                 y::Dual{T,V,N}) where {T,V,N}
    _val(x) > _val(y)
end

It turns out that comparison of Duals does not take place on the GPU, as long as all XC operations are done on the CPU. This might become a concern again in the future, once DftFunctionals.jl is refactored.

Comment on lines +7 to +8
import ForwardDiff
import ForwardDiff: Dual
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also be removed, right ?

Comment on lines 27 to 35
# Make sure that computations done by DftFunctionals.jl are done on the CPU (until refactoring)
for fun in (:potential_terms, :kernel_terms)
@eval function DftFunctionals.$fun(fun::DispatchFunctional, ρ::AT,
args...) where {AT <: AbstractGPUArray}
# Fallback implementation for the GPU: Transfer to the CPU and run computation there
cpuify(::Nothing) = nothing
cpuify(x::AbstractArray) = Array(x)
$fun(fun, Array(ρ), cpuify.(args)...)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have this, do we need the above version at all (which is simply more specific to Float64, right ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we probably only need the most general workaround. The thought process was that, once DftFunctionals.jl is refactored to run on the GPU, we can simply remove the AbstractGPUArray definition.

It's probably best to keep the code in the cleanest state in the meantime though, I'll see to it.

Comment on lines 255 to 256
Complex(Dual{T}(real(ψnk), real.(δψnk)),
Dual{T}(imag(ψnk), imag.(δψnk)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right thing to do ? Does this not loose the tags ?
@niklasschmitz @Technici4n may have better insights here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that seems to loose the tags

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't the T of Dual{T,V,N}} the tag? I am confused here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right T is the tag and not the value type. Seems fine then, this is just making the T,V,N explicit instead of letting the compiler infer them. I am surprised that this is required though. Would the root problem be the type-unstable nature of construct_value(basis)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would

Complex(Dual{T,V,N}(real(ψnk), real.(δψnk)),
        Dual{T,V,N}(imag(ψnk), imag.(δψnk)))

solve the issue?

Also, what about here? Don't we have a similar situation?

function LinearAlgebra.norm(x::SVector{S,<:Dual{Tg,T,N}}) where {S,Tg,T,N}
x_value = ForwardDiff.value.(x)
y = norm(x_value)
dy = ntuple(j->real(dot(x_value, ForwardDiff.partials.(x,j))) * pinv(y), N)
Dual{Tg}(y, dy)
end

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, I mean that Dual{T} is fine. I misread it the first time. What is weird to me is why this is necessary, given that the gpu compiler isn't operating directly on this method? (Or is it?)

Looking at @code_typed might be quite insightful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't see @Technici4n last message when I wrote the above.

The GPU compilers requires this to be able to compile. I am not sure what the root cause is, but type instability is a likely candidate. Generally, the GPU compiler is rather bad at type inference.

end

# Ensure functionals from DftFunctionals are sent to the CPU, until DftFunctionals.jl is refactored
function DftFunctionals.$fun(fun::DftFunctionals.Functional, density::LibxcDensities)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird to me. Why is this needed on top of the above ? Should the types not somehow depend on a GPU type here ?

Copy link
Collaborator Author

@abussy abussy Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky bit is to avoid ambiguity with the internal definitions of DftFunctionals.jl.

Adding the following to src/gpu/gpu_array.jl leads to ambiguity, because it does not specialize on the type of functional (:lda, :gga, or :mgga):

for fun in (:potential_terms, :kernel_terms)
    @eval function DftFunctionals.$fun(fun::DftFunctionals.Functional, ρ::AT,
                                       args...) where {AT <: AbstractGPUArray}
        # Fallback implementation for the GPU: Transfer to the CPU and run computation there
        cpuify(::Nothing) = nothing
        cpuify(x::AbstractArray) = Array(x)
        $fun(fun, Array(ρ), cpuify.(args)...)
    end
end

Either I write the above for each functional type (lot of code duplication), or I parametrize it with a second loop over functional types, e.g.:

for fun in (:potential_terms, :kernel_terms), ftype in (:lda, :gga, :mgga)
    @eval function DftFunctionals.$fun(fun::DftFunctionals.Functional{$(QuoteNode(ftype))},
                                       ρ::AT, args...) where {AT <: AbstractGPUArray}
        # Fallback implementation for the GPU: Transfer to the CPU and run computation there
        cpuify(::Nothing) = nothing
        cpuify(x::AbstractArray) = Array(x)
        $fun(fun, Array(ρ), cpuify.(args)...)
    end
end

I don't like any of these alternative very much. I think the current solution carries a very clear message: anything DftFunctionals related goes to the CPU.

copyto!(y, _mul(p, x))
end
function Base.:*(p::AbstractFFTs.Plan, x::AbstractArray{<:Complex{<:Dual{Tg}}}) where {Tg}
function _mul(p::AbstractFFTs.Plan, x::AbstractArray{<:Complex{<:Dual{Tg}}}) where {Tg}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this feels strange and is surprising to me. Why did you need this ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this workaround, the GPU compiler throws an invalid LLVM IR error during stress calculations. I think there is confusion around which method of Base.:* to use, but I don't understand why.

@@ -1,492 +1,66 @@
# Wrappers around ForwardDiff to fix tags and reduce compilation time.
Copy link
Member

@mfherbst mfherbst Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general idea to make these tests more generic, but I don't like the splitup. Because this way it's one more level of indirection I have to follow to see the actual test code (from the file and line information printed in the test suite output).

I recall I wrote test functions elsewhere that actually internally create multiple @testcase instances. Like you call just one function for the CPU tests and it internally actually creates the "templated" testcases with CPU architecture. Would that be an option here ? If not, reducing code duplication clearly wins and this is likely a good option.

Also for the cases that do not depend on architecture, I'd avoid the indirection alltogether and just keep the test code here.

But I have to think about this a bit more, if I can think of ideas how to improve the cut between the multiple files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants