[Breaking Change v10] Nstep no longer returns "discounts" #8

ymd-h · 2021-03-08T15:06:42Z

ymd-h
Mar 8, 2021
Maintainer

We released cpprb v10 just now. (Binary will be uploaded to PyPI soon.)
ReplayBuffer with Nstep no longer returns "discounts" because users can always multiply with fixed gamma ** nstep.

For example, if we have a terminated trajectory $s_0, s_1, s_2, s_3, s_4$ , then 3-step targets become as follows;

$\begin{align*} G_0^3 &= r_0 + \gamma r_1 + \gamma^2r_2 + \gamma^3 \max_a Q(s_{3},a) & d_0^3 = 0 \cr G_1^3 &= r_1 + \gamma r_2 + \gamma^2r_3 + \gamma^3 \max_a Q(s_{4},a) & d_1^3 = 0 \cr G_2^3 &= r_2 + \gamma r_3 + \gamma^2r_4 & d_2^3 = 1 \cr G_3^3 &= r_3 + \gamma r_4 & d_3^3 = 1 \cr G_4^3 &= r_4 & d_4^3 = 1 \end{align*}$

As long as done are correctly calculated, sample["rew"] + (gamma ** nstep) * (1 - sample["done"]) * Q(sample["next_obs"]).max(axis=1) is fine.

If you have any questions, please feel free to ask me.

Ref: https://gitlab.com/ymd_h/cpprb/-/issues/137
Ref: https://ymd_h.gitlab.io/cpprb/features/nstep/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Breaking Change v10] Nstep no longer returns "discounts" #8

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[Breaking Change v10] Nstep no longer returns "discounts" #8

Uh oh!

ymd-h Mar 8, 2021 Maintainer

Replies: 0 comments

ymd-h
Mar 8, 2021
Maintainer