Skip to content

Conversation

torfjelde
Copy link
Member

It seems overloading an external package in an extension doesn't work (which is probably for the better), so atm the CUDA tests are failing.

But if we move the overloads into the main package, they run. So probably should do that from now on.

@zuhengxu
Copy link
Member

zuhengxu commented Aug 15, 2023

I think now the CUDAext works properly, current tests about cuda all passes. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using NormalizingFlows

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

x = rand(rng, q0_g) # good 

However, there is still issue to fix---sample multiple samples at once, and sample from Bijectors.TransformedDistribuition . Minimal examples are as follows:

  • sample multiple samples in one batch
xs = rand(rng, q0_g, 10) # ambiguous 

error message:

ERROR: MethodError: rand(::CUDA.RNG, ::MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate, Continuous}, n::Int64)
    @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:23
  rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate}, n::Int64)
    @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:21
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous}, n::Int64)
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:16

Possible fix, define
  rand(::CUDA.RNG, ::Sampleable{Multivariate, Continuous}, ::Int64)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:42
  • sample from Bijectors.TransformedDistribution:
y = rand(rng, flow_g) # ambiguous

err meesage:

ERROR: MethodError: rand(::CUDA.RNG, ::MultivariateTransformed{MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ComposedFunction{PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, td::MultivariateTransformed)
    @ Bijectors ~/.julia/packages/Bijectors/cvMxj/src/transformed_distribution.jl:160
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous})
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:7

Possible fix, define
  rand(::CUDA.RNG, ::MultivariateTransformed)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:40

This is partially because we are overloading methods and types that do not own by this pkg.
Any thoughts about how to address this @torfjelde @sunxd3?

@sunxd3
Copy link
Member

sunxd3 commented Aug 16, 2023

I don't have a immediate solution other than the suggested fixes.
It is indeed a bit annoying, maybe we don't dispatch on rng?

@zuhengxu
Copy link
Member

zuhengxu commented Aug 16, 2023

It is indeed a bit annoying, maybe we don't dispatch on rng?

Yeah, I agree. For temporary solution, I'm thinking adding an additional argument for Distribution.rand, something like device to indicate on cpu or on gpu. But for long term fix, Im now leaning towards your previous attempts. Although this will require resolving some compatibility issue with Bijectors.


function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat)
# Replaced usage of scalar indexing.
CUDA.randn!(rng, x)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zuhengxu do you know why this change of yours was necessary? I thought Random.randn!(rng, x) should just dispatch to CUDA.randn!(rng, x)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, you are right---this is not necessary. I think I just made the change to ensure it's actually calling the cuda sampling. I can change it back.

Comment on lines 16 to 24
function Distributions.rand(
rng::CUDA.RNG,
s::Distributions.Sampleable{<:Distributions.ArrayLikeVariate,Distributions.Continuous},
n::Int,
)
return @inbounds Distributions.rand!(
rng, Distributions.sampler(s), CuArray{float(eltype(s))}(undef, length(s), n)
)
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of length here will cause some issues, e.g. what if s is wrapping a matrix distribution?

Maybe (undef, size(s)..., n) will do? But I don't quite recall what is the correct size here; should be somewhere in the Distributions.jl docs.

Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Revise = "295af30f-e4ad-537b-8983-00126c2a3abe"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed now (after you removed the test-file you were using)?

Suggested change
cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is needed if we want some of the Flux.jl chain to run properly on GPU. But you are right, it's not used for the current examples---they are all runing on CPUs. I'll remove it later

@torfjelde
Copy link
Member Author

Honestly, IMO, the best solution right now is just to add our own rand for now to avoid ambiguity errors.

If we want to properly support all of this, we'll have to go down the path of specializing the methods further, i.e. not do a Union as we've done now, which will take time and effort.

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

How does that sound?

@zuhengxu
Copy link
Member

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

Yeah, after thinking about it, I agree that this is probably the best way to go at this point. Working on it now!

@zuhengxu
Copy link
Member

zuhengxu commented Aug 22, 2023

I have adapted the NF.rand_device() approach. I think now we have a work around. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using Flux
import NormalizingFlows as NF

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts = reduce(, [f32(Bijectors.PlanarLayer(2)) for _ in 1:2])
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

@torfjelde @sunxd3 Let me know if this attempt looks good to you. If so, I'll update the docs.

@yebai
Copy link
Member

yebai commented Mar 5, 2025

@torfjelde @zuhengxu, can you try to finish this PR if nothing major stands?

@sunxd3
Copy link
Member

sunxd3 commented Mar 14, 2025

@yebai yebai requested review from penelopeysm and zuhengxu March 14, 2025 14:08
@penelopeysm
Copy link
Member

I'm very sorry - I don't think I have the knowledge to review this. Happy to try to look at specific questions if you have any though.

@yebai
Copy link
Member

yebai commented Mar 14, 2025

I'm very sorry - I don't think I have the knowledge to review this.

No worries, @penelopeysm. It might take you a while to become familiar with TuringLang libraries. In the meantime, please feel free to comment from the RSE perspective!

@Red-Portal
Copy link
Member

Red-Portal commented Mar 15, 2025

Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?


@testset "rand with CUDA" begin

# Bijectors versions use dot for broadcasting, which causes issues with CUDA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's status of GPU compatibility of Bijectors? Is there a list of bijectors that might cause issues with CUDA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anyone is very certain right now -- we need to do a sweep to tell.

generator is device specific (e.g. `CUDA.RNG`).
"""
function _device_specific_rand end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name _device_specific_rand is bit a mouthful for users; I think it's totally fine for internel usage. For the user API, would it be better to wrap it into iid_sample_reference(rng, dist, N) and iid_sample_flow(rng, flow, N)? And we can dispatch these two functions on the rng.

Doing this could also be benefitial if we want to allow relax the type of dist and flow (e.g., to adapt to Lux.jl).

Copy link
Member

@yebai yebai Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a look at what @Red-Portal suggested above. One might be able to use rand for GPUs, too: there are many improvements in the JuliaGPU ecosystem. rand usually assumes iid samples and is a much nicer API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're only using Gaussians, randn should be more useful to be precise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One might be able to use rand for GPUs, too

As demonstrated in my comments below, I don't think we can simply use rand with cuda rng to deal with general reference distributions. If all we need is std Gaussian reference distribution, that's fine. But I'm not a huge fan of limiting what reference distribution users can use.

The reason why I suggested the addtional iid_sample_reference and iid_sample_flow is that we can have a API so that users can write their own sampler on whatever device as they want. What are your guys' thoughts?

@zuhengxu
Copy link
Member

zuhengxu commented Mar 17, 2025

Thank you for all the comments and feedbacks! Here are some of my thoughts.

Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?

I don't think it works; or I could be not doing it the right way. The following code doesn't work:

using Random, Distributions, LinearAlgebra
using CUDA

rng_g = CUDA.default_rng()
q0_g = MvNormal(CuArray(zeros(T, 2)), CuArray(ones(T, 2)))

rand(rng_g, q0, 10)

This will return a cpu array

julia> rand(rng_g, q0, 10)
2×10 Matrix{Float32}:
 -0.322776   0.129745  1.08983   1.06144   -1.09644  -0.206741  0.273979   -0.518334  -1.35439    -1.79912
  0.259409  -0.583322  0.601871  0.571779   1.41544  -1.19601   0.0845693   0.766666  -0.0962037   0.0355074

that's why we need to have this _device_specific_rand function that can dispatch on the rngs.

Since we're only using Gaussians, randn should be more useful to be precise.

This part I agree. @sunxd3 maybe we can leverage the existing gpu compatibility of randn, e.g.,

julia> rng_g = CUDA.default_rng()
CUDA.RNG(0xd97bf5d1, 0x0000004a)

julia> T = Float64
Float64

julia> randn(rng_g, T, 2, 10)
2×10 CuArray{Float64, 2, CUDA.DeviceMemory}:
 -1.26503  -0.152555   0.921247  -1.05752  0.581864  -2.30691    0.26545   -1.1928     0.061966  -1.42031
 -1.81811  -0.393087  -1.85974   -1.39682  0.349741  -0.624886  -0.191652  -0.347422  -1.4505     1.74004

@Red-Portal
Copy link
Member

Yeah avoiding Distributions.jl like the plague is necessary.

@sunxd3
Copy link
Member

sunxd3 commented Mar 17, 2025

yep, randn is used at

function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat)
Random.randn!(rng, x)
Distributions.unwhiten!(d.Σ, x)
x .+= d.μ
return x
end

The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.

@zuhengxu
Copy link
Member

yep, randn is used at

function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat)
Random.randn!(rng, x)
Distributions.unwhiten!(d.Σ, x)
x .+= d.μ
return x
end

The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.

Yeah, overall the code under the hood looks great to me (thanks again @sunxd3!). My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing. I think wrap this into a more intuitive name and dispatch on the rng type would be more inituitive.

Aside from this, since the sampling function and logpdf function works with cuda, let's add some test with the whole pipline to see if the planar flow training and evaluation on GPU all work properly. For example, run this planar flow test with GPU.

@sunxd3
Copy link
Member

sunxd3 commented Mar 18, 2025

My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing.

I agree, my thought is that for now we can keep _device_specific_rand strictly internal. And we can expose it when we have better GPU support.

Re: flow test, happy to add it later.

@Red-Portal
Copy link
Member

My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?

@zuhengxu
Copy link
Member

My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?

For now, randn is probably sufficient for most things we do. However, my taste on a NF library should be as simple and flexible as just specifying base dist q0, bijection B and its inverse, (and maybe the loss function) and the rest should all be handled by the library. The fact that we are restricted to Gaussian base dist is super ideal.

@yebai
Copy link
Member

yebai commented Mar 20, 2025

Let's introduce randn in a separate PR and merge this PR so we can move on.

@yebai yebai merged commit 34ed26e into main Mar 20, 2025
4 checks passed
@yebai yebai deleted the torfjelde/cuda branch March 20, 2025 21:18
@zuhengxu zuhengxu mentioned this pull request Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants