Skip to content

Chapter 19, page 681Β #240

@g-i-o-r-g-i-o

Description

@g-i-o-r-g-i-o
Image

Bug Report: Incorrect return calculations for Episodes 1 and 2

Hi! First of all, thank you for this wonderful book β€” it's been an incredibly helpful resource for learning RL concepts. πŸ™

I believe I've found calculation errors in the Episodic versus continuing tasks section, specifically in the return computations for Episode 1 and Episode 2. Both share the same root cause (an off-by-one error in the number of transitions), and Episode 2 also contains repeated subscript typos.

Location

Chapter 19 (Reinforcement Learning), section "RL terminology: return, policy, and value function", subsection "The return" β€” computation of the return for Episodes 1 and 2.


Episode 1: BBCCCCBAT β†’ pass (final reward = +1)

The issue

The episode has 8 non-terminal states:

$$S_0=B,; S_1=B,; S_2=C,; S_3=C,; S_4=C,; S_5=C,; S_6=B,; S_7=A,; S_8=\text{pass (terminal)}$$

So $T = 8$, with 8 transitions and 8 rewards ($R_1$ through $R_8$), where $R_1 = \cdots = R_7 = 0$ and $R_8 = +1$.

However, the book computes $G_0$ summing only up to $\gamma^6 R_7$ (7 rewards instead of 8):

$$G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 + \cdots + \gamma^6 R_7$$

It should be:

$$G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 + \cdots + \gamma^7 R_8 = 0.9^7 \approx 0.478$$

What the book says vs. corrected values

Time step Book writes Book value Should be Corrected value
$t = 0$ $G_0 = 0 + \cdots + 0.9^6$ $0.531$ $G_0 = 0.9^7$ $0.478$
$t = 1$ $G_1 = 1 \times \gamma^5$ $0.590$ $G_1 = 0.9^6$ $0.531$
$t = 2$ $G_2 = 1 \times \gamma^4$ $0.656$ $G_2 = 0.9^5$ $0.590$
$t = 3$ (not shown) β€” $G_3 = 0.9^4$ $0.656$
$t = 4$ (not shown) β€” $G_4 = 0.9^3$ $0.729$
$t = 5$ (not shown) β€” $G_5 = 0.9^2$ $0.810$
$t = 6$ $G_6 = 1 \times \gamma$ $0.900$ $G_6 = 0.9^1$ $0.900$ βœ“
$t = 7$ $G_7 = 1$ $1.000$ $G_7 = 1$ $1.000$ βœ“

Corrected calculation with backward recursion ($G_t = R_{t+1} + \gamma , G_{t+1}$)

Time step Reward $R_{t+1}$ Calculation Corrected value
$G_7$ $R_8 = +1$ $1$ $1.000$
$G_6$ $R_7 = 0$ $0 + 0.9 \times 1.000$ $0.900$
$G_5$ $R_6 = 0$ $0 + 0.9 \times 0.900$ $0.810$
$G_4$ $R_5 = 0$ $0 + 0.9 \times 0.810$ $0.729$
$G_3$ $R_4 = 0$ $0 + 0.9 \times 0.729$ $0.656$
$G_2$ $R_3 = 0$ $0 + 0.9 \times 0.656$ $0.590$
$G_1$ $R_2 = 0$ $0 + 0.9 \times 0.590$ $0.531$
$G_0$ $R_1 = 0$ $0 + 0.9 \times 0.531$ $0.478$

Episode 2: ABBBBBBBBBT β†’ fail (final reward = βˆ’1)

The issue

The episode has 10 non-terminal states:

$$S_0=A,; S_1=B,; S_2=B,; S_3=B,; S_4=B,; S_5=B,; S_6=B,; S_7=B,; S_8=B,; S_9=B,; S_{10}=\text{fail (terminal)}$$

So $T = 10$, with 10 transitions and 10 rewards ($R_1$ through $R_{10}$), where $R_1 = \cdots = R_9 = 0$ and $R_{10} = -1$.

This episode has two types of errors:

Error 1: Off-by-one (same as Episode 1)

The book computes $G_0 = -1 \times \gamma^8 = -0.430$, but with 10 transitions it should be $G_0 = -1 \times \gamma^9 = -0.387$.

Error 2: Subscript typo β€” the book writes $G_0$ on almost every line

The book uses $G_0$ as the subscript for every time step instead of $G_t$, which is very confusing. Here is exactly what the book prints:

Line in book Book writes (verbatim) Book value
$t = 0$ $G_0 = -1 \times \gamma^8$ $-0.430$
$t = 1$ $G_0 = -1 \times \gamma^7$ $-0.478$
$t = 2$ (not shown, implied by "...") β€”
$t = 3$ (not shown, implied by "...") β€”
$t = 4$ (not shown, implied by "...") β€”
$t = 5$ (not shown, implied by "...") β€”
$t = 6$ (not shown, implied by "...") β€”
$t = 7$ (not shown, implied by "...") β€”
$t = 8$ $G_0 = -1 \times \gamma$ $-0.900$
$t = 9$ $G_{10} = -1$ $-1.000$

As you can see, the subscript is wrong on every shown line:

  • At $t = 0$: writes $G_0$ β†’ this one is actually correct
  • At $t = 1$: writes $G_0$ β†’ should be $G_1$
  • At $t = 8$: writes $G_0$ β†’ should be $G_8$
  • At $t = 9$: writes $G_{10}$ β†’ should be $G_9$

What the book should say (all steps, corrected)

Time step Correct subscript Calculation Corrected value
$t = 0$ $G_0 = -1 \times \gamma^9$ $-0.9^9$ $-0.387$
$t = 1$ $G_1 = -1 \times \gamma^8$ $-0.9^8$ $-0.430$
$t = 2$ $G_2 = -1 \times \gamma^7$ $-0.9^7$ $-0.478$
$t = 3$ $G_3 = -1 \times \gamma^6$ $-0.9^6$ $-0.531$
$t = 4$ $G_4 = -1 \times \gamma^5$ $-0.9^5$ $-0.590$
$t = 5$ $G_5 = -1 \times \gamma^4$ $-0.9^4$ $-0.656$
$t = 6$ $G_6 = -1 \times \gamma^3$ $-0.9^3$ $-0.729$
$t = 7$ $G_7 = -1 \times \gamma^2$ $-0.9^2$ $-0.810$
$t = 8$ $G_8 = -1 \times \gamma^1$ $-0.9^1$ $-0.900$
$t = 9$ $G_9 = -1$ $-1$ $-1.000$

Corrected calculation with backward recursion ($G_t = R_{t+1} + \gamma , G_{t+1}$)

Time step Reward $R_{t+1}$ Calculation Corrected value
$G_9$ $R_{10} = -1$ $-1$ $-1.000$
$G_8$ $R_9 = 0$ $0 + 0.9 \times (-1.000)$ $-0.900$
$G_7$ $R_8 = 0$ $0 + 0.9 \times (-0.900)$ $-0.810$
$G_6$ $R_7 = 0$ $0 + 0.9 \times (-0.810)$ $-0.729$
$G_5$ $R_6 = 0$ $0 + 0.9 \times (-0.729)$ $-0.656$
$G_4$ $R_5 = 0$ $0 + 0.9 \times (-0.656)$ $-0.590$
$G_3$ $R_4 = 0$ $0 + 0.9 \times (-0.590)$ $-0.531$
$G_2$ $R_3 = 0$ $0 + 0.9 \times (-0.531)$ $-0.478$
$G_1$ $R_2 = 0$ $0 + 0.9 \times (-0.478)$ $-0.430$
$G_0$ $R_1 = 0$ $0 + 0.9 \times (-0.430)$ $-0.387$

Summary

Both episodes share an off-by-one error: the calculations count one fewer transition than the episode actually has, shifting all return values by one power of $\gamma$. The values near the terminal state are correct, but the earlier time steps are all off.

Episode 2 additionally has a subscript typo: the book writes $G_0$ at every time step ($t = 0, 1, 8$) instead of the correct $G_0, G_1, G_8$, and writes $G_{10}$ instead of $G_9$ at the final step.

Thank you again for your work on this book, and I hope this note is helpful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions