Skip to content
Discussion options

You must be logged in to vote

There are two possible solutions for this problem.

The first solution is suggested by @YouJiacheng.

You can store actions, states and returns. And evaluate log_prob as a function of params when compute loss. And I personally think you don't need to write REINFORCE as a class.

# collect states, actions and returns from agent in environment
def compute_loss(params):
    loss = 0
    for state, action, ret in zip(states, actions, returns):
        logits = policy_network.apply(params, state)
        log_prob = log_prob_func(logits, action) # i.e. log_softmax(logits)[action]
        loss = loss + ret * log_prob
    return -loss

Or

# collect states, actions and returns from agent in environme…

Replies: 3 comments 8 replies

Comment options

You must be logged in to vote
3 replies
@BalajiAI
Comment options

@YouJiacheng
Comment options

@BalajiAI
Comment options

Comment options

You must be logged in to vote
2 replies
@YouJiacheng
Comment options

@BalajiAI
Comment options

Comment options

You must be logged in to vote
3 replies
@YouJiacheng
Comment options

@BalajiAI
Comment options

@yecohn
Comment options

Answer selected by BalajiAI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants