Skip to content

Conversation

@marius311
Copy link
Owner

WIP towards using one-GPU-per-thread instead of one-GPU-per-process. The big advantage is you don't need to launch multiple processes (which is slow and at least doubles the startup time) and memory can be shared between the GPUs using unified memory, instead of having to be serialized and distributed between the different processes (which necessarily passes through the CPU memory).

Right now this works:

tmap(collect(devices())) do dev
    device!(dev)
    for i=1:N
         gradient-> norm(LenseFlow(ϕ)*f), ϕ)
    end
end

although MAP and sampling is still WIP.

Right now this needs CUDA 11.2 and my branch of CUDA.jl https://github.com/marius311/CUDA.jl/tree/no_gc_ctx_switch

@marius311 marius311 marked this pull request as draft February 27, 2021 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants