Replies: 1 comment 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Sorry in advance for a long post. The TL;DR is that I'm considering trying to add a threaded version of
rollout
function available in the python bindings to speed up batched closed-loop simulations.We're looking to use MuJoCo for RL on a single machine/single cluster node (i.e. not distributed), using TorchRL. I naively created a

DMControlEnv("humanoid", "stand")
in TorchRL, but noticed that stepping it was quite slow, compared to just stepping the underlying MuJoCo model:(Here, the
torchrl.envs.ParallelEnv
wrapping in the bottom bar only uses a single worker to make the overhead apparent, but even scaling it to multiple workers still appears to come with quite a big overhead).I haven't dug into exactly what dm_control and TorchRL are doing that takes so much longer than just stepping the physics, but I imagine at least part of the issue is that the python overhead for stepping the environment, calculating rewards, checking termination criteria, etc, is quite large compared to stepping a single MuJoCo simulation a single step. Does this make sense or is there something else I'm missing?
If that is the case, then a better implementation of the environment would be to somehow make a single batch call to MuJoCo to step multiple environments, and then do all the subsequent calculations with vectorized numpy/torch. I searched and found issues #203 and #897, and that


mujoco.rollout.rollout
would probably be the best way to go. Especially since it could even support some basic multithreading using python'sThreadPoolExecutor
without the GIL blocking too much. So, I did a naive wall-clock benchmark of this for different number of timesteps (nstep
argument inrollout
) and number of parallel simulations (nroll
argument inrollout
) and got this result:The second panel is using a
ThreadPoolExecutor
as in issue #897 androllout_test.py
. The script I used is here. As I'd expect, the speed-up is greatest at multiples of the number of cores (48; hence the stripes) and is otherwise generally larger with more parallel roll-outs. But it seems that for a small number of time-steps, the overhead remains quite significant even for 1024 parallel roll-outs. Plotting the same data using lines:My conclusion from this is that for open loop roll-outs, python threading is indeed quite efficient (as demonstrated in #897), but for closed-loop RL with intermediate batch sizes (say, 256 to 1024 parallel simulations and 1 to 5 physics steps per control step) there is some considerable overhead, making this setup around 10x slower than open-loop roll-outs. Does that sound reasonable?
If that is so, I'm thinking for our use case, there could still be a 10x speed-up if the threading was done on the c++-side instead of in python. So I'm thinking of trying to add a c++-threaded version of
rollout
at least for our own use (and perhaps eventually make a PR if it works well), but wanted to first hear if that would make sense, or if anyone is working on something similar?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions