Replies: 2 comments
-
Rewards are sparse in our case, so find it difficult to see how we can use this. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I imagined fine-tuning the language model to just predict responses given the meta data after the prompt but maybe the sparse labels prevent making a sum of the future reward over multiple responses. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I would like to see if the PPO method is better than the Decision Transformer method of learning to maximize reward.
Beta Was this translation helpful? Give feedback.
All reactions