How to calculate gradient for the loss in the REINFORCE algorithm #10307

BalajiAI · 2022-04-15T15:23:11Z

BalajiAI
Apr 15, 2022

I've implemented the REINFORCE algorithm using PyTorch. I planned to port it to JAX/Flax. While doing so, I'd stuck at a problem.

What the problem is?
I've to compute the gradient of a function using jax.grad(f) wrt the parameters. But the function doesn't take parameters as an argument. How then I can take the gradient?

I'll attach my both PyTorch implementation and partially completed JAX/Flax version below.
Reinforce-PyTorch.txt
Reinforce-JAX.txt

Pls share JAX version of REINFORCE algorithm, if you came across.

Answered by BalajiAI

Apr 17, 2022

There are two possible solutions for this problem.

The first solution is suggested by @YouJiacheng.

You can store actions, states and returns. And evaluate log_prob as a function of params when compute loss. And I personally think you don't need to write REINFORCE as a class.

# collect states, actions and returns from agent in environment
def compute_loss(params):
    loss = 0
    for state, action, ret in zip(states, actions, returns):
        logits = policy_network.apply(params, state)
        log_prob = log_prob_func(logits, action) # i.e. log_softmax(logits)[action]
        loss = loss + ret * log_prob
    return -loss

Or

# collect states, actions and returns from agent in environme…

View full answer

YouJiacheng · 2022-04-15T16:42:13Z

YouJiacheng
Apr 15, 2022

You can store actions, states and returns.
And evaluate log_prob as a function of params when compute loss.
And I personally think you don't need to write REINFORCE as a class.

# collect states, actions and returns from agent in environment
def compute_loss(params):
    loss = 0
    for state, action, ret in zip(states, actions, returns):
        logits = policy_network.apply(params, state)
        log_prob = log_prob_func(logits, action) # i.e. log_softmax(logits)[action]
        loss = loss + ret * log_prob
    return -loss

Or

# collect states, actions and returns from agent in environment, and stack to array
def compute_loss(params):
    log_probs = jax.vmap(log_prob_func)(jax.vmap(lambda s: policy_network.apply(params, s))(states), actions)
    return jnp.sum(-returns * log_probs)

3 replies

BalajiAI Apr 15, 2022
Author

Thanks!
I got some errors. Can you review?
Here is the link for the colab - https://colab.research.google.com/drive/1u7klWxYuO842fR3TVwrdnXOCOQME0Std?usp=sharing

YouJiacheng Apr 15, 2022

Ah, it is a bit subtle.

loss, grads = jax.value_and_grad(compute_loss,argnums=(1,))(agent.policy,agent.params,states,actions,rewards)

should be

loss, grads = jax.value_and_grad(compute_loss,argnums=1)(agent.policy,agent.params,states,actions,rewards)

If you use argnums=(1,), then grads will be a tuple of length 1.
i.e. argnums=1 => argnums=(1,) cause grads => (grads,)
You can define compute_loss inside train_agent (a closure), then you can make it have only one parameter.

BalajiAI Apr 17, 2022
Author

Thanks a lot!
I made it work.
In the case of PyTorch implementation, I can able to get the total reward in a single episode of about 200 (which is highest maximum), after training for 200 episodes. But in the JAX/Flax implementation, I can't able to attain the maximum reward even after training for 400 episodes.
I want to get my JAX/Flax implementation to the level of PyTorch implementation. The possible bugs in the JAX/Flax code are sampling actions from the neural network's logits, choosing hyperparameters.
Any feedback would be valuable to me.

BalajiAI · 2022-04-17T08:20:46Z

BalajiAI
Apr 17, 2022
Author

Since REINFORCE is an on-policy algorithm, it throws off the collected trajectories after updating the policy(which happens at the end of each episode).
So the compute_loss function would be as follows; in which the agent interacts with a environment for t time_steps and collects the rewards, then after the interaction, we'll calculate the loss based on returns and log probabilities of actions.

def compute_loss(env,policy,params,nb_timesteps,discount_factor=0.99):
    state = env.reset()
    rewards = []
    log_probs = []
    for timestep in range(nb_timesteps):
        action,log_prob = act(policy,params,state) #samples an action and computes log_probability of that action
        state,reward,done,info = env.step(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        if done:
            break
    
    #calculate return using rewards & discount factor
    
    loss = jnp.asarray(log_probs)*jnp.asarray(returns)
    loss = jnp.sum(loss)
 
    return -loss

Then, use jax.grad(compute_loss,argnums=2)(env,policy,params,nb_steps,discount_factor) to calculate the gradient of the loss wrt the parameters of the neural network.

2 replies

YouJiacheng Apr 17, 2022

action, log_prob = act(policy, params, state, key)
key, subkey = jax.random.split(key)

should be

key, subkey = jax.random.split(key)
action, log_prob = act(policy, params, state, subkey)

Consider following example

import jax
def sample(key, n):
  xs = []
  for _ in range(n):
    key, subkey = jax.random.split(key)
    xs.append(jax.random.normal(subkey))
  return xs

key = jax.random.PRNGKey(0)
a = sample(key, 10)
key, subkey = jax.random.split(key)
b = sample(key, 10)
assert a[1:] == b[:-1] # a[1] == b[0], a[2] == b[1], ...

BalajiAI Apr 17, 2022
Author

Got it. Thanks again!
Seems interesting.

BalajiAI · 2022-04-17T09:11:43Z

BalajiAI
Apr 17, 2022
Author

There are two possible solutions for this problem.

The first solution is suggested by @YouJiacheng.

You can store actions, states and returns. And evaluate log_prob as a function of params when compute loss. And I personally think you don't need to write REINFORCE as a class.

# collect states, actions and returns from agent in environment
def compute_loss(params):
    loss = 0
    for state, action, ret in zip(states, actions, returns):
        logits = policy_network.apply(params, state)
        log_prob = log_prob_func(logits, action) # i.e. log_softmax(logits)[action]
        loss = loss + ret * log_prob
    return -loss

Or

# collect states, actions and returns from agent in environment, and stack to array
def compute_loss(params):
    log_probs = jax.vmap(log_prob_func)(jax.vmap(lambda s: policy_network.apply(params, s))(states), actions)
    return jnp.sum(-returns * log_probs)

The second solution is suggested by me (@BalajiAI) which is efficient than the first solution interms of both time & memory, since we don't have to recompute the output (logits) for every states using the NN.

Since REINFORCE is an on-policy algorithm, it throws off the collected trajectories after updating the policy(which happens at the end of each episode).
So the compute_loss function would be as follows; in which the agent interacts with a environment for t time_steps and collects the rewards, then after the interaction, we'll calculate the loss based on returns and log probabilities of actions.
def compute_loss(env,policy,params,nb_timesteps,discount_factor=0.99):
    state = env.reset()
    rewards = []
    log_probs = []
    for timestep in range(nb_timesteps):
        action,log_prob = act(policy,params,state) #samples an action and computes log_probability of that action
        state,reward,done,info = env.step(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        if done:
            break
    
    #calculate return using rewards & discount factor
    
    loss = jnp.asarray(log_probs)*jnp.asarray(returns)
    loss = jnp.sum(loss)
 
    return -loss
Then, use jax.grad(compute_loss,argnums=2)(env,policy,params,nb_steps,discount_factor) to calculate the gradient of the loss wrt the parameters of the neural network.

3 replies

YouJiacheng Apr 17, 2022

sometimes your environment is not compatible with jit.
even worse, it can be incompatible with grad.(possibly produce wrong result silently)

BalajiAI Apr 17, 2022
Author

Thank you for your opinion.

yecohn Jan 21, 2024

I tried this method but I found that taking gradient on trajectory can be very slow to compute. Any ideas why is that ?

How to calculate gradient for the loss in the REINFORCE algorithm #10307

Uh oh!

BalajiAI Apr 15, 2022

Replies: 3 comments · 8 replies

Uh oh!

Uh oh!

YouJiacheng Apr 15, 2022

Uh oh!

Uh oh!

BalajiAI Apr 15, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Apr 15, 2022

Uh oh!

Uh oh!

BalajiAI Apr 17, 2022 Author

Uh oh!

Uh oh!

BalajiAI Apr 17, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Apr 17, 2022

Uh oh!

BalajiAI Apr 17, 2022 Author

Uh oh!

BalajiAI Apr 17, 2022 Author

Uh oh!

Uh oh!

YouJiacheng Apr 17, 2022

Uh oh!

BalajiAI Apr 17, 2022 Author

Uh oh!

yecohn Jan 21, 2024

BalajiAI
Apr 15, 2022

Replies: 3 comments 8 replies

YouJiacheng
Apr 15, 2022

BalajiAI Apr 15, 2022
Author

BalajiAI Apr 17, 2022
Author

BalajiAI
Apr 17, 2022
Author

BalajiAI Apr 17, 2022
Author

BalajiAI
Apr 17, 2022
Author

BalajiAI Apr 17, 2022
Author