Decision Transformer Training Option #3438

oscarthf · 2023-02-07T15:16:29Z

oscarthf
Feb 7, 2023

I would like to see if the PPO method is better than the Decision Transformer method of learning to maximize reward.

sanagno · 2023-02-11T12:27:01Z

sanagno
Feb 11, 2023
Collaborator

Rewards are sparse in our case, so find it difficult to see how we can use this.

0 replies

oscarthf · 2023-02-11T20:12:17Z

oscarthf
Feb 11, 2023
Author

I imagined fine-tuning the language model to just predict responses given the meta data after the prompt but maybe the sparse labels prevent making a sum of the future reward over multiple responses.
Maybe there will be a growing subset if fully labeled responses though which you could more easily sum

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decision Transformer Training Option #3438

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decision Transformer Training Option #3438

Uh oh!

oscarthf Feb 7, 2023

Replies: 2 comments

Uh oh!

sanagno Feb 11, 2023 Collaborator

Uh oh!

oscarthf Feb 11, 2023 Author

oscarthf
Feb 7, 2023

sanagno
Feb 11, 2023
Collaborator

oscarthf
Feb 11, 2023
Author