Character limps when trained using Gym env #3414

tfederico · 2020-11-24T15:34:07Z

tfederico
Nov 24, 2020

i tried to train the character using the hyperparams given by @ManifoldFR in #3076 .

However, after 60 millions steps the character averages a reward of ~300/350 and when I test it the character walks by moving always the same foot and then dragging the other one.

Here are my training and enjoy scripts:

train

import os
import gym
import torch.nn as nn
from stable_baselines3 import PPO
from stable_baselines3.common.cmd_util import make_vec_env
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.monitor import Monitor
import pybullet_envs
from pybullet_envs.deep_mimic.gym_env.custom_callbacks import ProgressBarManager
from stable_baselines3.common.callbacks import CallbackList, CheckpointCallback, EvalCallback

# Create log dir
log_dir = "/tmp/gym/"
os.makedirs(log_dir, exist_ok=True)

model_args = dict(
    norm_reward=False,
    learning_rate=2.5e-6,
    n_epochs=4,
    target_kl=0.05,
    ent_coef=0,
    vf_coef=5.,
    max_grad_norm=100.
)
policy_kwargs = dict(
    activation_fn=nn.ReLU,
    net_arch=[dict(pi=[1024, 512], vf=[1024, 512])],
    log_std_init=-3,
    ortho_init=True,
    optimizer_kwargs=dict(weight_decay=1.0e-5)
)

checkpoint_callback = CheckpointCallback(save_freq=100000, save_path=log_dir)
# Separate evaluation env
eval_env = make_vec_env('HumanoidDeepMimicWalkBulletEnv-v1')
eval_env = VecNormalize(eval_env, norm_reward=model_args['norm_reward'])
eval_callback = EvalCallback(eval_env, best_model_save_path=log_dir,
                             log_path=log_dir, n_eval_episodes=10,
                             eval_freq=5000, deterministic=True)
# Create the callback list
callback = CallbackList([checkpoint_callback, eval_callback])

n_envs = 8
env = DummyVecEnv([lambda : Monitor(gym.make('HumanoidDeepMimicWalkBulletEnv-v1'), log_dir) for _ in range(n_envs)])
#env = make_vec_env('HumanoidDeepMimicWalkBulletEnv-v1', n_envs=n_envs)
env = VecNormalize(env, norm_reward=model_args['norm_reward'])

model = PPO(
    'MlpPolicy',
    env,
    learning_rate=model_args['learning_rate'],
    n_epochs=model_args['n_epochs'],
    ent_coef=model_args['ent_coef'],
    vf_coef=model_args['vf_coef'],
    max_grad_norm=model_args['max_grad_norm'],
    target_kl=model_args['target_kl'],
    tensorboard_log=log_dir,
    policy_kwargs=policy_kwargs
)

n_steps = int(6e7)
with ProgressBarManager(n_steps) as prog_callback: # tqdm progress bar closes correctly
    model.learn(n_steps, callback=[prog_callback, callback])

env.save(log_dir+"vecnormalize.pkl")

enjoy

import time
import torch.nn as nn
from stable_baselines3 import PPO
import pybullet_envs
from stable_baselines3.common.cmd_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize

log_dir = "/tmp/gym/"

policy_kwargs = dict(
    activation_fn=nn.ReLU,
    net_arch=[dict(pi=[1024, 512], vf=[1024, 512])],
    log_std_init=-3,
    ortho_init=True,
    optimizer_kwargs=dict(weight_decay=1.0e-5)
)

env = make_vec_env('HumanoidDeepMimicWalkBulletEnv-v1')
env = VecNormalize.load(log_dir+"vecnormalize.pkl", env)

model = PPO.load(log_dir+"best_model", env=env)

env.render(mode='human')

obs = env.reset()
dones = [False]

while not all(dones):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = env.step(action)
    time.sleep(1./30.)

In deep_mimic_env.py I modified the action space by using a FakeBox class that inherits gym.spaces.Box

class FakeBox(gym.spaces.Box):
    def __init__(self, low, high, shape=None, dtype=np.float32):
        super().__init__(low, high, shape, dtype)

    def sample(self):
        return truncnorm.rvs(self.low, self.high, size=self.shape[0])

tfederico · 2020-11-24T15:37:25Z

tfederico
Nov 24, 2020
Author

@ManifoldFR may I ask how you obtained the values for the model/policy hyperparams? Did you perform tuning using Optuna as in the RL zoo?

0 replies

ManifoldFR · 2020-11-24T15:41:25Z

ManifoldFR
Nov 24, 2020

I started from the parameters of Jason Peng's code, but for things like the maximum grad norm, target KL or vf coef I had to make guesses because these were not parameters in his PPO implementation (also he had two separate optimizers for the policy and value functions).

…

On Tue, 24 Nov 2020, 16:37 Federico, ***@***.***> wrote: @ManifoldFR <https://github.com/ManifoldFR> may I ask how you obtained the values for the model/policy hyperparams? Did you perform tuning using Optuna as in the RL zoo? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3161 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA427HJQ7DP7TAFYUPRE4DSRPHMLANCNFSM4UBBMUGA> .

0 replies

erwincoumans · 2020-11-24T15:41:42Z

erwincoumans
Nov 24, 2020
Maintainer

Training sometimes get stuck in such behavior. Did you try a couple of training runs?

0 replies

ManifoldFR · 2020-11-24T15:44:38Z

ManifoldFR
Nov 24, 2020

What about the discount factor and lambda parameter for TD(lambda) ? Also, are you using my branch with the modifications to the Gym env?
Here's a dropbox link with a policy trained with this (I think for that run I set a slightly higher learning rate)

0 replies

tfederico · 2020-11-24T15:47:33Z

tfederico
Nov 24, 2020
Author

@erwincoumans I tried a couple of runs using my script and another one using the training script from stable-baselines 3 zoo

@ManifoldFR I used the default values for the discount factor and lambda parameter. Did you use custom values? I wondered you also used the default ones given that you didn't list them with the other params.
I used the version with action/observation scaling, so I guess it's the same.

0 replies

ManifoldFR · 2020-11-24T15:52:51Z

ManifoldFR
Nov 24, 2020

Sorry about that, I use a strategy where I have a default set of PPO params on top of SB3's defaults, and the values I gave you were the overrides for the both of them. Check the hyperparams.yml in the Dropbox link I sent, I use the same discount and lambda (0.95) as Jason Peng. I think one of the important things was I use 4096 timesteps per env per rollout

…

On Tue, 24 Nov 2020, 16:47 Federico, ***@***.***> wrote: @erwincoumans <https://github.com/erwincoumans> I tried a couple of runs using my script and another one using the training script from stable-baselines 3 zoo @ManifoldFR <https://github.com/ManifoldFR> I used the default values for the discount factor and lambda parameter. Did you use custom values? I wondered you also used the default ones given that you didn't list them with the other params. I used the version with action/observation scaling, so I guess it's the same. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3161 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA427DM23A7MY6NJZJAP3DSRPISNANCNFSM4UBBMUGA> .

0 replies

tfederico · 2020-11-24T15:55:01Z

tfederico
Nov 24, 2020
Author

Ah I see, no worries!

0 replies

tfederico · 2020-11-24T15:56:51Z

tfederico
Nov 24, 2020
Author

I was wondering whether I was doing something wrong in the training setup or when loading the model, but I figured there might be something wrong with the parameters given that the training would get stuck

0 replies

ManifoldFR · 2020-11-24T16:08:37Z

ManifoldFR
Nov 24, 2020

Yes, the method is quite brittle I'm afraid, some hyperparameters can send you to very bad local minima. Have you looked at other papers like Facebook's ScaDiver ? The approach is the same but the subreward aggregation/early termination strategies are different. Maybe it's more robust but I haven't tested yet

0 replies

tfederico · 2020-11-24T16:18:29Z

tfederico
Nov 24, 2020
Author

I haven't read the paper but I saw their repo and video, seems very promising. I am trying to stick with DeepMimic because I don't want to change everything halfway :)

Also, if I recall correctly, they use a different format for clips (3d joints instead of quaternions maybe?), so I would have to adapt the tracking algorithm to that as well.

0 replies

ManifoldFR · 2020-11-24T16:23:30Z

ManifoldFR
Nov 24, 2020

They use the more standard BVH format instead of the custom format used in deepmimic, they have code to convert to character poses in reduced coordinates to supply to pybullet

…

On Tue, 24 Nov 2020, 17:18 Federico, ***@***.***> wrote: I haven't read the paper but I saw their repo and video, seems very promising. I am trying to stick with DeepMimic because I don't want to change everything halfway :) Also, if I recall correctly, they use a different format for clips (3d joints instead of quaternions maybe?), so I would have to adapt the tracking algorithm to that as well. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3161 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFA427B7A3AWBIY5AWRD4FDSRPMGPANCNFSM4UBBMUGA> .

0 replies

tfederico · 2020-11-24T16:25:15Z

tfederico
Nov 24, 2020
Author

Btw I could help but notice that in deep_mimic_env.py you calculate the reward before applying the new action, is that intentional?

reward = self._internal_env.calc_reward(agent_id)

# Apply control action
self._internal_env.set_action(agent_id, action)

I don't think it would actually make a huge difference, but it seemed a bit odd.

0 replies

ManifoldFR · 2020-11-25T14:01:54Z

ManifoldFR
Nov 25, 2020

That's something I'm not 100% sure about. DeepMimic's interaction loop is pretty non-standard and it's hard to tell when the rewards are calculated: I think it's with respect to the current state s_t before applying the action a_t (and getting to state s_{t+1}) rather than afterwards.

IMO either one works as long as you make sure the reference pose you're comparing the state to is the right one (same time step).
ScaDiver computes rewards wrt the state at time t (using state data from before applying the action)
https://github.com/facebookresearch/ScaDiver/blob/96001537f9ab2eddfe871b78807923a30f7d012f/env_humanoid_base.py#L368-L385

0 replies

Character limps when trained using Gym env #3414

Uh oh!

Uh oh!

tfederico Nov 24, 2020

train

enjoy

Replies: 13 comments

Uh oh!

tfederico Nov 24, 2020 Author

Uh oh!

Uh oh!

ManifoldFR Nov 24, 2020

Uh oh!

erwincoumans Nov 24, 2020 Maintainer

Uh oh!

ManifoldFR Nov 24, 2020

Uh oh!

tfederico Nov 24, 2020 Author

Uh oh!

ManifoldFR Nov 24, 2020

Uh oh!

tfederico Nov 24, 2020 Author

Uh oh!

tfederico Nov 24, 2020 Author

Uh oh!

ManifoldFR Nov 24, 2020

Uh oh!

tfederico Nov 24, 2020 Author

Uh oh!

ManifoldFR Nov 24, 2020

Uh oh!

tfederico Nov 24, 2020 Author

Uh oh!

Uh oh!

ManifoldFR Nov 25, 2020

tfederico
Nov 24, 2020

tfederico
Nov 24, 2020
Author

ManifoldFR
Nov 24, 2020

erwincoumans
Nov 24, 2020
Maintainer

ManifoldFR
Nov 24, 2020

tfederico
Nov 24, 2020
Author

ManifoldFR
Nov 24, 2020

tfederico
Nov 24, 2020
Author

tfederico
Nov 24, 2020
Author

ManifoldFR
Nov 24, 2020

tfederico
Nov 24, 2020
Author

ManifoldFR
Nov 24, 2020

tfederico
Nov 24, 2020
Author

ManifoldFR
Nov 25, 2020