-
Notifications
You must be signed in to change notification settings - Fork 264
Description
I've read your paper and have been using the MuJoCo Playground to test my algorithms. Thank you for your great work. From the report, I see that the Brax framework was used for training and evaluation, with results reported across environments. I have two questions:
Were the Brax hyperparameters tuned separately for each environment? I noticed variations in the hyperparameters across environments. For the dm_control environments, only two sets were shared—one for PPO and one for SAC.
Regarding PPO agents, do the maximum achievable returns per episode vary significantly across environments? In some cases, returns reach around 900–1000, but in others, Brax seems to struggle—for example, HopperHop wasn’t solved, FingerSpin reaches ~600, and PendulumSwingup only ~50. Is there a standard expected return for each environment (i.e. around 1000 for each environment for dm_control)? If so, could these differences be due to insufficient hyperparameter tuning or is it expected?