-
Notifications
You must be signed in to change notification settings - Fork 97
Port stress/response calculations to the GPU #1187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Merged master. Adapted tests to the refactoring brought by #1182. Additionally, removed this problematic bit of code in It turns out that comparison of Duals does not take place on the GPU, as long as all XC operations are done on the CPU. This might become a concern again in the future, once |
| import ForwardDiff | ||
| import ForwardDiff: Dual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can also be removed, right ?
src/gpu/gpu_arrays.jl
Outdated
| # Make sure that computations done by DftFunctionals.jl are done on the CPU (until refactoring) | ||
| for fun in (:potential_terms, :kernel_terms) | ||
| @eval function DftFunctionals.$fun(fun::DispatchFunctional, ρ::AT, | ||
| args...) where {AT <: AbstractGPUArray} | ||
| # Fallback implementation for the GPU: Transfer to the CPU and run computation there | ||
| cpuify(::Nothing) = nothing | ||
| cpuify(x::AbstractArray) = Array(x) | ||
| $fun(fun, Array(ρ), cpuify.(args)...) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have this, do we need the above version at all (which is simply more specific to Float64, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we probably only need the most general workaround. The thought process was that, once DftFunctionals.jl is refactored to run on the GPU, we can simply remove the AbstractGPUArray definition.
It's probably best to keep the code in the cleanest state in the meantime though, I'll see to it.
src/workarounds/forwarddiff_rules.jl
Outdated
| Complex(Dual{T}(real(ψnk), real.(δψnk)), | ||
| Dual{T}(imag(ψnk), imag.(δψnk))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right thing to do ? Does this not loose the tags ?
@niklasschmitz @Technici4n may have better insights here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that seems to loose the tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isn't the T of Dual{T,V,N}} the tag? I am confused here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right T is the tag and not the value type. Seems fine then, this is just making the T,V,N explicit instead of letting the compiler infer them. I am surprised that this is required though. Would the root problem be the type-unstable nature of construct_value(basis)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would
Complex(Dual{T,V,N}(real(ψnk), real.(δψnk)),
Dual{T,V,N}(imag(ψnk), imag.(δψnk)))
solve the issue?
Also, what about here? Don't we have a similar situation?
DFTK.jl/src/workarounds/forwarddiff_rules.jl
Lines 321 to 326 in 3551ad3
| function LinearAlgebra.norm(x::SVector{S,<:Dual{Tg,T,N}}) where {S,Tg,T,N} | |
| x_value = ForwardDiff.value.(x) | |
| y = norm(x_value) | |
| dy = ntuple(j->real(dot(x_value, ForwardDiff.partials.(x,j))) * pinv(y), N) | |
| Dual{Tg}(y, dy) | |
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no, I mean that Dual{T} is fine. I misread it the first time. What is weird to me is why this is necessary, given that the gpu compiler isn't operating directly on this method? (Or is it?)
Looking at @code_typed might be quite insightful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't see @Technici4n last message when I wrote the above.
The GPU compilers requires this to be able to compile. I am not sure what the root cause is, but type instability is a likely candidate. Generally, the GPU compiler is rather bad at type inference.
| end | ||
|
|
||
| # Ensure functionals from DftFunctionals are sent to the CPU, until DftFunctionals.jl is refactored | ||
| function DftFunctionals.$fun(fun::DftFunctionals.Functional, density::LibxcDensities) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks weird to me. Why is this needed on top of the above ? Should the types not somehow depend on a GPU type here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky bit is to avoid ambiguity with the internal definitions of DftFunctionals.jl.
Adding the following to src/gpu/gpu_array.jl leads to ambiguity, because it does not specialize on the type of functional (:lda, :gga, or :mgga):
for fun in (:potential_terms, :kernel_terms)
@eval function DftFunctionals.$fun(fun::DftFunctionals.Functional, ρ::AT,
args...) where {AT <: AbstractGPUArray}
# Fallback implementation for the GPU: Transfer to the CPU and run computation there
cpuify(::Nothing) = nothing
cpuify(x::AbstractArray) = Array(x)
$fun(fun, Array(ρ), cpuify.(args)...)
end
end
Either I write the above for each functional type (lot of code duplication), or I parametrize it with a second loop over functional types, e.g.:
for fun in (:potential_terms, :kernel_terms), ftype in (:lda, :gga, :mgga)
@eval function DftFunctionals.$fun(fun::DftFunctionals.Functional{$(QuoteNode(ftype))},
ρ::AT, args...) where {AT <: AbstractGPUArray}
# Fallback implementation for the GPU: Transfer to the CPU and run computation there
cpuify(::Nothing) = nothing
cpuify(x::AbstractArray) = Array(x)
$fun(fun, Array(ρ), cpuify.(args)...)
end
end
I don't like any of these alternative very much. I think the current solution carries a very clear message: anything DftFunctionals related goes to the CPU.
| copyto!(y, _mul(p, x)) | ||
| end | ||
| function Base.:*(p::AbstractFFTs.Plan, x::AbstractArray{<:Complex{<:Dual{Tg}}}) where {Tg} | ||
| function _mul(p::AbstractFFTs.Plan, x::AbstractArray{<:Complex{<:Dual{Tg}}}) where {Tg} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this feels strange and is surprising to me. Why did you need this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this workaround, the GPU compiler throws an invalid LLVM IR error during stress calculations. I think there is confusion around which method of Base.:* to use, but I don't understand why.
| @@ -1,492 +1,66 @@ | |||
| # Wrappers around ForwardDiff to fix tags and reduce compilation time. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the general idea to make these tests more generic, but I don't like the splitup. Because this way it's one more level of indirection I have to follow to see the actual test code (from the file and line information printed in the test suite output).
I recall I wrote test functions elsewhere that actually internally create multiple @testcase instances. Like you call just one function for the CPU tests and it internally actually creates the "templated" testcases with CPU architecture. Would that be an option here ? If not, reducing code duplication clearly wins and this is likely a good option.
Also for the cases that do not depend on architecture, I'd avoid the indirection alltogether and just keep the test code here.
But I have to think about this a bit more, if I can think of ideas how to improve the cut between the multiple files.
This PR enables ForwardDiff calculations (stress and response) on the GPU. Main changes are:
src/workarounds/forwarddiff_rules.jl)DftFunctionals.jlpackageWith this PR, all ForwardDiff workflows currently tested on the CPU successfully run on both NVIDIA and AMD GPUs.
Future improvements will come with:
PlaneWaveBasisinstantiation