Initial work on CUDA-compat #25

torfjelde · 2023-08-10T15:25:49Z

It seems overloading an external package in an extension doesn't work (which is probably for the better), so atm the CUDA tests are failing.

But if we move the overloads into the main package, they run. So probably should do that from now on.

zuhengxu · 2023-08-15T23:16:07Z

I think now the CUDAext works properly, current tests about cuda all passes. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using NormalizingFlows

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

x = rand(rng, q0_g) # good

However, there is still issue to fix---sample multiple samples at once, and sample from Bijectors.TransformedDistribuition . Minimal examples are as follows:

sample multiple samples in one batch

xs = rand(rng, q0_g, 10) # ambiguous

error message:

ERROR: MethodError: rand(::CUDA.RNG, ::MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::Int64) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate, Continuous}, n::Int64)
    @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:23
  rand(rng::Random.AbstractRNG, s::Sampleable{Multivariate}, n::Int64)
    @ Distributions ~/.julia/packages/Distributions/Ufrz2/src/multivariates.jl:21
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous}, n::Int64)
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:16

Possible fix, define
  rand(::CUDA.RNG, ::Sampleable{Multivariate, Continuous}, ::Int64)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:42

sample from Bijectors.TransformedDistribution:

y = rand(rng, flow_g) # ambiguous

err meesage:

ERROR: MethodError: rand(::CUDA.RNG, ::MultivariateTransformed{MvNormal{Float32, PDMats.ScalMat{Float32}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ComposedFunction{PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, PlanarLayer{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}) is ambiguous.

Candidates:
  rand(rng::Random.AbstractRNG, td::MultivariateTransformed)
    @ Bijectors ~/.julia/packages/Bijectors/cvMxj/src/transformed_distribution.jl:160
  rand(rng::CUDA.RNG, s::Sampleable{<:ArrayLikeVariate, Continuous})
    @ NormalizingFlowsCUDAExt ~/Research/Turing/NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl:7

Possible fix, define
  rand(::CUDA.RNG, ::MultivariateTransformed)

Stacktrace:
 [1] top-level scope
   @ ~/Research/Turing/NormalizingFlows.jl/example/test.jl:40

This is partially because we are overloading methods and types that do not own by this pkg.
Any thoughts about how to address this @torfjelde @sunxd3?

sunxd3 · 2023-08-16T17:32:48Z

I don't have a immediate solution other than the suggested fixes.
It is indeed a bit annoying, maybe we don't dispatch on rng?

zuhengxu · 2023-08-16T21:57:18Z

It is indeed a bit annoying, maybe we don't dispatch on rng?

Yeah, I agree. For temporary solution, I'm thinking adding an additional argument for Distribution.rand, something like device to indicate on cpu or on gpu. But for long term fix, Im now leaning towards your previous attempts. Although this will require resolving some compatibility issue with Bijectors.

torfjelde · 2023-08-17T12:10:06Z

ext/NormalizingFlowsCUDAExt.jl

+
+function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat)
+    # Replaced usage of scalar indexing.
+    CUDA.randn!(rng, x)


@zuhengxu do you know why this change of yours was necessary? I thought Random.randn!(rng, x) should just dispatch to CUDA.randn!(rng, x)?

ahh, you are right---this is not necessary. I think I just made the change to ensure it's actually calling the cuda sampling. I can change it back.

torfjelde · 2023-08-17T12:13:16Z

ext/NormalizingFlowsCUDAExt.jl

+function Distributions.rand(
+    rng::CUDA.RNG,
+    s::Distributions.Sampleable{<:Distributions.ArrayLikeVariate,Distributions.Continuous},
+    n::Int,
+)
+    return @inbounds Distributions.rand!(
+        rng, Distributions.sampler(s), CuArray{float(eltype(s))}(undef, length(s), n)
+    )
+end


Usage of length here will cause some issues, e.g. what if s is wrapping a matrix distribution?

Maybe (undef, size(s)..., n) will do? But I don't quite recall what is the correct size here; should be somewhere in the Distributions.jl docs.

torfjelde · 2023-08-17T12:14:33Z

example/Project.toml

 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Revise = "295af30f-e4ad-537b-8983-00126c2a3abe"
 Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
+cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd"


Is this needed now (after you removed the test-file you were using)?

Suggested change

cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd"

This file is needed if we want some of the Flux.jl chain to run properly on GPU. But you are right, it's not used for the current examples---they are all runing on CPUs. I'll remove it later

test/interface.jl

torfjelde · 2023-08-17T12:20:37Z

Honestly, IMO, the best solution right now is just to add our own rand for now to avoid ambiguity errors.

If we want to properly support all of this, we'll have to go down the path of specializing the methods further, i.e. not do a Union as we've done now, which will take time and effort.

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

How does that sound?

zuhengxu · 2023-08-20T00:11:39Z

For now, just make a NormalizingFlows.rand_device or something, that just calls rand by default, but which we can then overload to our liking without running into ambiguity-errors.

Yeah, after thinking about it, I agree that this is probably the best way to go at this point. Working on it now!

zuhengxu · 2023-08-22T05:34:52Z

I have adapted the NF.rand_device() approach. I think now we have a work around. The following code runs properly:

using CUDA
using LinearAlgebra
using Distributions, Random
using Bijectors
using Flux
import NormalizingFlows as NF

rng = CUDA.default_rng()
T = Float32
q0_g = MvNormal(CUDA.zeros(T, 2), I)

CUDA.functional()
ts = reduce(∘, [f32(Bijectors.PlanarLayer(2)) for _ in 1:2])
ts_g = gpu(ts)
flow_g = transformed(q0_g, ts_g)

@torfjelde @sunxd3 Let me know if this attempt looks good to you. If so, I'll update the docs.

yebai · 2025-03-05T21:01:44Z

@torfjelde @zuhengxu, can you try to finish this PR if nothing major stands?

sunxd3 · 2025-03-14T10:03:33Z

Some benchmark: https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc

penelopeysm · 2025-03-14T14:10:42Z

I'm very sorry - I don't think I have the knowledge to review this. Happy to try to look at specific questions if you have any though.

yebai · 2025-03-14T14:38:36Z

I'm very sorry - I don't think I have the knowledge to review this.

No worries, @penelopeysm. It might take you a while to become familiar with TuringLang libraries. In the meantime, please feel free to comment from the RSE perspective!

Red-Portal · 2025-03-15T18:52:12Z

Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?

zuhengxu · 2025-03-16T00:07:18Z

test/ext/CUDA/cuda.jl

+
+@testset "rand with CUDA" begin
+
+    # Bijectors versions use dot for broadcasting, which causes issues with CUDA.


What's status of GPU compatibility of Bijectors? Is there a list of bijectors that might cause issues with CUDA?

I don't think anyone is very certain right now -- we need to do a sweep to tell.

zuhengxu · 2025-03-16T00:13:31Z

src/NormalizingFlows.jl

+generator is device specific (e.g. `CUDA.RNG`).
+"""
+function _device_specific_rand end
+


The name _device_specific_rand is bit a mouthful for users; I think it's totally fine for internel usage. For the user API, would it be better to wrap it into iid_sample_reference(rng, dist, N) and iid_sample_flow(rng, flow, N)? And we can dispatch these two functions on the rng.

Doing this could also be benefitial if we want to allow relax the type of dist and flow (e.g., to adapt to Lux.jl).

Let's take a look at what @Red-Portal suggested above. One might be able to use rand for GPUs, too: there are many improvements in the JuliaGPU ecosystem. rand usually assumes iid samples and is a much nicer API.

Since we're only using Gaussians, randn should be more useful to be precise.

One might be able to use rand for GPUs, too

As demonstrated in my comments below, I don't think we can simply use rand with cuda rng to deal with general reference distributions. If all we need is std Gaussian reference distribution, that's fine. But I'm not a huge fan of limiting what reference distribution users can use.

The reason why I suggested the addtional iid_sample_reference and iid_sample_flow is that we can have a API so that users can write their own sampler on whatever device as they want. What are your guys' thoughts?

zuhengxu · 2025-03-17T01:16:29Z

Thank you for all the comments and feedbacks! Here are some of my thoughts.

Sorry for being late for the party; in my experience making AdvancedVI GPU-compatible, it was sufficient to simply swap the RNGs with the CUDA ones to make things work seemlessly. When this actually make it into AdvancedVI, I will probably add a simple indirection for selecting which RNG get supplied to rand. Here, on the other hand, was there any reason for defining CUDA-specific rand functions?

I don't think it works; or I could be not doing it the right way. The following code doesn't work:

using Random, Distributions, LinearAlgebra
using CUDA

rng_g = CUDA.default_rng()
q0_g = MvNormal(CuArray(zeros(T, 2)), CuArray(ones(T, 2)))

rand(rng_g, q0, 10)

This will return a cpu array

julia> rand(rng_g, q0, 10)
2×10 Matrix{Float32}:
 -0.322776   0.129745  1.08983   1.06144   -1.09644  -0.206741  0.273979   -0.518334  -1.35439    -1.79912
  0.259409  -0.583322  0.601871  0.571779   1.41544  -1.19601   0.0845693   0.766666  -0.0962037   0.0355074

that's why we need to have this _device_specific_rand function that can dispatch on the rngs.

Since we're only using Gaussians, randn should be more useful to be precise.

This part I agree. @sunxd3 maybe we can leverage the existing gpu compatibility of randn, e.g.,

julia> rng_g = CUDA.default_rng()
CUDA.RNG(0xd97bf5d1, 0x0000004a)

julia> T = Float64
Float64

julia> randn(rng_g, T, 2, 10)
2×10 CuArray{Float64, 2, CUDA.DeviceMemory}:
 -1.26503  -0.152555   0.921247  -1.05752  0.581864  -2.30691    0.26545   -1.1928     0.061966  -1.42031
 -1.81811  -0.393087  -1.85974   -1.39682  0.349741  -0.624886  -0.191652  -0.347422  -1.4505     1.74004

Red-Portal · 2025-03-17T01:27:58Z

Yeah avoiding Distributions.jl like the plague is necessary.

sunxd3 · 2025-03-17T09:08:44Z

yep, randn is used at

NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl

Lines 43 to 48 in 3f07fe5

    
           function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat) 
        
               Random.randn!(rng, x) 
        
               Distributions.unwhiten!(d.Σ, x) 
        
               x .+= d.μ 
        
               return x 
        
           end

The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.

zuhengxu · 2025-03-17T18:53:54Z

yep, randn is used at

NormalizingFlows.jl/ext/NormalizingFlowsCUDAExt.jl

Lines 43 to 48 in 3f07fe5

function Distributions._rand!(rng::CUDA.RNG, d::Distributions.MvNormal, x::CuVecOrMat)

Random.randn!(rng, x)

Distributions.unwhiten!(d.Σ, x)

x .+= d.μ

return x

end

The benchmark (https://gist.github.com/sunxd3/f680959b3f16e61521f1396db5c509bc) shows that sampling from MvNormal scales well, but sampling from flows does not yet.

Yeah, overall the code under the hood looks great to me (thanks again @sunxd3!). My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing. I think wrap this into a more intuitive name and dispatch on the rng type would be more inituitive.

Aside from this, since the sampling function and logpdf function works with cuda, let's add some test with the whole pipline to see if the planar flow training and evaluation on GPU all work properly. For example, run this planar flow test with GPU.

sunxd3 · 2025-03-18T17:11:05Z

My only concern is that the sampling function that users get to call is _device_specific_rand, which is not very pleasing.

I agree, my thought is that for now we can keep _device_specific_rand strictly internal. And we can expose it when we have better GPU support.

Re: flow test, happy to add it later.

Red-Portal · 2025-03-18T17:40:01Z

My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?

zuhengxu · 2025-03-18T17:47:25Z

My thought is that we don't have to generalize to allow any Distributions.jl distribution and just use randn directly no? I am aware that using heavier-tailed base distributions have been proposed before, is that what we are trying to explicitly support?

For now, randn is probably sufficient for most things we do. However, my taste on a NF library should be as simple and flexible as just specifying base dist q0, bijection B and its inverse, (and maybe the loss function) and the rest should all be handled by the library. The fact that we are restricted to Gaussian base dist is super ideal.

yebai · 2025-03-20T21:18:40Z

Let's introduce randn in a separate PR and merge this PR so we can move on.

torfjelde and others added 9 commits August 10, 2023 16:16

added CUDA extension

4d54b64

Merge branch 'main' into torfjelde/cuda

6bbe297

fixed merge issue

d9f3f1a

rm extra weakdeps

a3619ca

edit toml

0dde183

@require cuda ext

1f795a1

minor update tests

847a520

try to fix ambiguity

51dc577

rm tmp test file

7a56c05

torfjelde commented Aug 17, 2023

View reviewed changes

zuhengxu added 6 commits August 21, 2023 12:56

wip on randdevice interface

5ec4317

move cuda part in ext

e1fa4b9

rename sampler.jl to sample.jl

fb07372

update objectives/elbo.jl to use rand_device

84fce4c

update test/cuda.jl

b573508

rm exmaple/test.jl

e8457d4

zuhengxu added 5 commits August 21, 2023 22:39

fix CI test error

c21400a

fix CI error

0ced915

update flux compat for test/

45c8e0a

rm tmp test file

4be9588

fix cuda err diffresults.gradient_result

1e72b9d

yebai assigned zuhengxu and torfjelde Mar 5, 2025

sunxd3 added 9 commits March 13, 2025 10:35

fix test error

5bcedc6

clean up

e4f4bde

fix more errors

4f8cac8

try to fix test error

9783220

add back allowscalar

2fbc6fa

refactoring

b524e0c

try fixing doc error

6851639

try fix doc again

dc67b9d

add ref link

3f07fe5

yebai requested review from penelopeysm and zuhengxu March 14, 2025 14:08

yebai approved these changes Mar 14, 2025

View reviewed changes

zuhengxu reviewed Mar 16, 2025

View reviewed changes

yebai merged commit 34ed26e into main Mar 20, 2025
4 checks passed

yebai deleted the torfjelde/cuda branch March 20, 2025 21:18

yebai mentioned this pull request Mar 20, 2025

Interface: randn for iid samples #48

Open

zuhengxu mentioned this pull request Jun 20, 2025

More CUDA compatibility #52

Draft


		@testset "rand with CUDA" begin

		# Bijectors versions use dot for broadcasting, which causes issues with CUDA.

Initial work on CUDA-compat #25

Initial work on CUDA-compat #25

Uh oh!

Conversation

torfjelde commented Aug 10, 2023

Uh oh!

zuhengxu commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunxd3 commented Aug 16, 2023

Uh oh!

zuhengxu commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

torfjelde commented Aug 17, 2023

Uh oh!

zuhengxu commented Aug 20, 2023

Uh oh!

zuhengxu commented Aug 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yebai commented Mar 5, 2025

Uh oh!

sunxd3 commented Mar 14, 2025

Uh oh!

penelopeysm commented Mar 14, 2025

Uh oh!

yebai commented Mar 14, 2025

Uh oh!

Red-Portal commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yebai Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuhengxu commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Red-Portal commented Mar 17, 2025

Uh oh!

sunxd3 commented Mar 17, 2025

Uh oh!

zuhengxu commented Mar 17, 2025

Uh oh!

sunxd3 commented Mar 18, 2025

Uh oh!

Red-Portal commented Mar 18, 2025

Uh oh!

zuhengxu commented Mar 18, 2025

Uh oh!

yebai commented Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!

zuhengxu commented Aug 15, 2023 •

edited

Loading

zuhengxu commented Aug 16, 2023 •

edited

Loading

zuhengxu commented Aug 22, 2023 •

edited

Loading

Red-Portal commented Mar 15, 2025 •

edited

Loading

yebai Mar 16, 2025 •

edited

Loading

zuhengxu commented Mar 17, 2025 •

edited

Loading