This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Replies: 1 comment
-
|
@mxnet-label-bot add [modeling, gluon, question] |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I wonder the inner mechanism of gradient calculation with this example. Here I took some excerpts from model.py as below:
First, I know for actor critic algorithm, policy advantages should be maximized while state value difference should be minimized. And the absolute value for the 2 values are the same.
Then, my quesiton:
log_policygot to be multipled with advantage, then treat the negative result as policy_loss toautograd.backward. But there is so such manipulation here in functiontrain_step, does that mean thelog_policyis multipled implicitly?2.for state value estimation, to minimize difference, usually L1Loss or L2Loss is taken to
autograd.backwardin gluon version. But there seems a trick used here as the comment said," # NOTE(reed): The grads of values is actually negative advantages.", Similar question like 1, what is exactly calculated here withv_gradsand corresponding net outputvaluewhen back propagating?I know this question maybe somehow related to math, and sorry for my poor math with gradient calculation. Hope someone could give an clear answer, thanks.
Beta Was this translation helpful? Give feedback.
All reactions