Questions about vmap on GPU and execution time scaling w.r.t to input size when only using the GPU #19618

YannBerthelot · 2024-02-01T15:19:44Z

YannBerthelot
Feb 1, 2024

Hello everyone !

I've been messing around with Jax (in reinforcement learning) for around 6 months now. My knowledge about the specifics of Jax is still quite limited when it comes to how it effectively runs things behind the hood.

My questions revolves around how does Jax handles computations under map on GPU when it comes to increasing the number of vectorized operations? Here is my use case to clarify my question :
I am training a reinforcement learning agent. In order to study its performance I need to run the training on multiple seeds. I have been using Jax as a tool to be able to vectorize these trainings and run them all at once on a single GPU. The goal being to have the computations take as much time for 1 seed as it would for 100 seeds as long as it fits on the GPU, which would allow for high throughput.
The global function looks something like
seeds = jnp.array(list(range(N_seeds)))
jax.jit(jax.vmap(agent.train, in_axes=0))(seeds)

I have been running experiments on GPU (GeForce GTX TITAN X) and it so happens that that when I change the size of seeds the execution time gets longer (40s for 1M time steps on 1 seed, 45 for 10 seeds and 78 for 100 seeds). Am I minsunderstanding how vmap works on GPU to expect the execution time not to scale with input size or do I have a problem somewhere in my code that leads to this issue?

I also observe that the GPU time spent accessing memory increases with input size (0% for 1 seed, ~3% for 10 seeds and ~30% for 100 seeds). So it could be that the code does not exclusively runs on GPU and that the dialogue between CPU and GPU that arises with high input size would slow down execution. Would there by a way to find out where in my code this could come from?

Here is the link to the repo if you have the time to dive into (as it is a bit complex and hard to reduce to a meaningful example in my case) : https://github.com/YannBerthelot/jaxppo/tree/rnn
https://github.com/YannBerthelot/jaxppo/blob/rnn/jaxppo/train.py having most of the Jax logic.

Thanks in advance for any help or tips on how to improve performance on my approach ! Do not hesitate if you need more details.

(I believe my question is linked to this other one, #19103, however I believe that in my case the whole program is run on GPU and not just a part of it, so there's plenty of room for parallel optimization)

Answered by jakevdp

Feb 1, 2024

Hi - so I think the main misunderstanding here is that vmap doesn't explicitly have anything to do with parallelism: it just converts unbatched instructions to batched instructions. So a vmapped vector product is just a matrix project of the batched input. A vmapped sum is just a sum along one axis of the batched input.

As for why large input sizes slow down your computation... well, the size of the computation in vmap scales linearly with the size of the input. From the times you quote, if I'm understanding correctly it looks like the actual wall-time scaling is very sub-linear in the number of batches, meaning that for your program, the compiler is making very good use of the hardware.

D…

View full answer

jakevdp · 2024-02-01T17:55:28Z

jakevdp
Feb 1, 2024
Maintainer

Hi - so I think the main misunderstanding here is that vmap doesn't explicitly have anything to do with parallelism: it just converts unbatched instructions to batched instructions. So a vmapped vector product is just a matrix project of the batched input. A vmapped sum is just a sum along one axis of the batched input.

As for why large input sizes slow down your computation... well, the size of the computation in vmap scales linearly with the size of the input. From the times you quote, if I'm understanding correctly it looks like the actual wall-time scaling is very sub-linear in the number of batches, meaning that for your program, the compiler is making very good use of the hardware.

Does that help answer your question?

2 replies

YannBerthelot Feb 1, 2024
Author

Thank your for your answer ! I guess I was indeed misunderstanding how vmap works. I guess sub-linear scaling is the best I can hope for then (which is already nice!).

Regarding the time spent accessing the memory (which has to slow down the program in some way), do you have an idea how to "profile" when Jax is moving stuff from GPU to CPU? Ideally this would only be the case at the start and end of training but it seems to happen quite a lot and it scales with input size. So it seems that it happens more than I would like but I don't know how to identify operations which may lead to this.

jakevdp Feb 1, 2024
Maintainer

Regarding memory: it's hard to say in general. If you're worried that host-to-device transfer might be playing a role, you could turn on the transfer guard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about vmap on GPU and execution time scaling w.r.t to input size when only using the GPU #19618

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Questions about vmap on GPU and execution time scaling w.r.t to input size when only using the GPU #19618

Uh oh!

YannBerthelot Feb 1, 2024

Replies: 1 comment · 2 replies

Uh oh!

jakevdp Feb 1, 2024 Maintainer

Uh oh!

YannBerthelot Feb 1, 2024 Author

Uh oh!

jakevdp Feb 1, 2024 Maintainer

YannBerthelot
Feb 1, 2024

Replies: 1 comment 2 replies

jakevdp
Feb 1, 2024
Maintainer

YannBerthelot Feb 1, 2024
Author

jakevdp Feb 1, 2024
Maintainer