-
Notifications
You must be signed in to change notification settings - Fork 5
Initial work on CUDA-compat #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think now the CUDAext works properly, current tests about cuda all passes. The following code runs properly: using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using NormalizingFlows
rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)
CUDA.functional()
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)
x = rand(rng, q0_g) # good However, there is still issue to fix---sample multiple samples at once, and sample from
xs = rand(rng, q0_g, 10) # ambiguous error message: ERROR: MethodError: rand(::CUDA.RNG, ::MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64) is ambiguous.
Candidates:
rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate, Continuous}, n::Int64)
@ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:23
rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate}, n::Int64)
@ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:21
rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous}, n::Int64)
@ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:16
Possible fix, define
rand(::CUDA.RNG, ::Sampleable{Multivariate, Continuous}, ::Int64)
Stacktrace:
[1] top-level scope
@ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:42
y = rand(rng, flow_g) # ambiguous err meesage: ERROR: MethodError: rand(::CUDA.RNG, ::MultivariateTransformed{MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ComposedFunction{PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}) is ambiguous.
Candidates:
rand(rng::Random.AbstractRNG, td::MultivariateTransformed)
@ Bijectors ~/.julia/packages/Bijectors/cvMxj/src/transformed_distribution.jl:160
rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous})
@ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:7
Possible fix, define
rand(::CUDA.RNG, ::MultivariateTransformed)
Stacktrace:
[1] top-level scope
@ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:40 This is partially because we are overloading methods and types that do not own by this pkg. |
I don't have a immediate solution other than the suggested fixes. |
Yeah, I agree. For temporary solution, I'm thinking adding an additional argument for |
ext/NormalizingFlowsCUDAExt.jl
Outdated
|
||
function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat) | ||
# Replaced usage of scalar indexing. | ||
CUDA.randn!(rng, x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zuhengxu do you know why this change of yours was necessary? I thought Random.randn!(rng, x)
should just dispatch to CUDA.randn!(rng, x)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, you are right---this is not necessary. I think I just made the change to ensure it's actually calling the cuda sampling. I can change it back.
ext/NormalizingFlowsCUDAExt.jl
Outdated
function Distributions.rand( | ||
rng::CUDA.RNG, | ||
s::Distributions.Sampleable{<:Distributions.ArrayLikeVariate,Distributions.Continuous}, | ||
n::Int, | ||
) | ||
return @inbounds Distributions.rand!( | ||
rng, Distributions.sampler(s), CuArray{float(eltype(s))}(undef, length(s), n) | ||
) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage of length
here will cause some issues, e.g. what if s
is wrapping a matrix distribution?
Maybe (undef, size(s)..., n)
will do? But I don't quite recall what is the correct size here; should be somewhere in the Distributions.jl docs.
example/Project.toml
Outdated
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" | ||
Revise = "295af30f-e4ad-537b-8983-00126c2a3abe" | ||
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f" | ||
cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed now (after you removed the test-file you were using)?
cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is needed if we want some of the Flux.jl chain to run properly on GPU. But you are right, it's not used for the current examples---they are all runing on CPUs. I'll remove it later
Honestly, IMO, the best solution right now is just to add our own If we want to properly support all of this, we'll have to go down the path of specializing the methods further, i.e. not do a For now, just make a How does that sound? |
Yeah, after thinking about it, I agree that this is probably the best way to go at this point. Working on it now! |
I have adapted the using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using Flux
import NormalizingFlows as NF
rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)
CUDA.functional()
ts = reduce(∘, [f32(Bijectors.PlanarLayer(2)) for _ in 1:2])
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g) @torfjelde @sunxd3 Let me know if this attempt looks good to you. If so, I'll update the docs. |
@torfjelde @zuhengxu, can you try to finish this PR if nothing major stands? |
Some benchmark: https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc |
I'm very sorry - I don't think I have the knowledge to review this. Happy to try to look at specific questions if you have any though. |
No worries, @penelopeysm. It might take you a while to become familiar with TuringLang libraries. In the meantime, please feel free to comment from the RSE perspective! |
Sorry for being late for the party; in my experience making |
|
||
@testset "rand with CUDA" begin | ||
|
||
# Bijectors versions use dot for broadcasting, which causes issues with CUDA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's status of GPU compatibility of Bijectors? Is there a list of bijectors that might cause issues with CUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anyone is very certain right now -- we need to do a sweep to tell.
generator is device specific (e.g. `CUDA.RNG`). | ||
""" | ||
function _device_specific_rand end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name _device_specific_rand
is bit a mouthful for users; I think it's totally fine for internel usage. For the user API, would it be better to wrap it into iid_sample_reference(rng, dist, N)
and iid_sample_flow(rng, flow, N)
? And we can dispatch these two functions on the rng
.
Doing this could also be benefitial if we want to allow relax the type of dist
and flow
(e.g., to adapt to Lux.jl
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take a look at what @Red-Portal suggested above. One might be able to use rand
for GPUs, too: there are many improvements in the JuliaGPU ecosystem. rand
usually assumes iid
samples and is a much nicer API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're only using Gaussians, randn
should be more useful to be precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One might be able to use rand for GPUs, too
As demonstrated in my comments below, I don't think we can simply use rand
with cuda rng to deal with general reference distributions. If all we need is std Gaussian reference distribution, that's fine. But I'm not a huge fan of limiting what reference distribution users can use.
The reason why I suggested the addtional iid_sample_reference
and iid_sample_flow
is that we can have a API so that users can write their own sampler on whatever device as they want. What are your guys' thoughts?
Thank you for all the comments and feedbacks! Here are some of my thoughts.
I don't think it works; or I could be not doing it the right way. The following code doesn't work:
This will return a cpu array
that's why we need to have this
This part I agree. @sunxd3 maybe we can leverage the existing gpu compatibility of
|
Yeah avoiding Distributions.jl like the plague is necessary. |
yep, NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl Lines 43 to 48 in 3f07fe5
The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet. |
Yeah, overall the code under the hood looks great to me (thanks again @sunxd3!). My only concern is that the sampling function that users get to call is Aside from this, since the sampling function and logpdf function works with cuda, let's add some test with the whole pipline to see if the planar flow training and evaluation on GPU all work properly. For example, run this planar flow test with GPU. |
I agree, my thought is that for now we can keep Re: flow test, happy to add it later. |
My thought is that we don't have to generalize to allow any |
For now, |
Let's introduce |
It seems overloading an external package in an extension doesn't work (which is probably for the better), so atm the CUDA tests are failing.
But if we move the overloads into the main package, they run. So probably should do that from now on.