Bug Report: Incorrect return calculations for Episodes 1 and 2
Hi! First of all, thank you for this wonderful book β it's been an incredibly helpful resource for learning RL concepts. π
I believe I've found calculation errors in the Episodic versus continuing tasks section, specifically in the return computations for Episode 1 and Episode 2. Both share the same root cause (an off-by-one error in the number of transitions), and Episode 2 also contains repeated subscript typos.
Location
Chapter 19 (Reinforcement Learning), section "RL terminology: return, policy, and value function", subsection "The return" β computation of the return for Episodes 1 and 2.
Episode 1: BBCCCCBAT β pass (final reward = +1)
The issue
The episode has 8 non-terminal states:
$$S_0=B,; S_1=B,; S_2=C,; S_3=C,; S_4=C,; S_5=C,; S_6=B,; S_7=A,; S_8=\text{pass (terminal)}$$
So $T = 8$, with 8 transitions and 8 rewards ($R_1$ through $R_8$), where $R_1 = \cdots = R_7 = 0$ and $R_8 = +1$.
However, the book computes $G_0$ summing only up to $\gamma^6 R_7$ (7 rewards instead of 8):
$$G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 + \cdots + \gamma^6 R_7$$
It should be:
$$G_0 = R_1 + \gamma R_2 + \gamma^2 R_3 + \cdots + \gamma^7 R_8 = 0.9^7 \approx 0.478$$
What the book says vs. corrected values
| Time step |
Book writes |
Book value |
Should be |
Corrected value |
| $t = 0$ |
$G_0 = 0 + \cdots + 0.9^6$ |
$0.531$ |
$G_0 = 0.9^7$ |
$0.478$ |
| $t = 1$ |
$G_1 = 1 \times \gamma^5$ |
$0.590$ |
$G_1 = 0.9^6$ |
$0.531$ |
| $t = 2$ |
$G_2 = 1 \times \gamma^4$ |
$0.656$ |
$G_2 = 0.9^5$ |
$0.590$ |
| $t = 3$ |
(not shown) |
β |
$G_3 = 0.9^4$ |
$0.656$ |
| $t = 4$ |
(not shown) |
β |
$G_4 = 0.9^3$ |
$0.729$ |
| $t = 5$ |
(not shown) |
β |
$G_5 = 0.9^2$ |
$0.810$ |
| $t = 6$ |
$G_6 = 1 \times \gamma$ |
$0.900$ |
$G_6 = 0.9^1$ |
$0.900$ β |
| $t = 7$ |
$G_7 = 1$ |
$1.000$ |
$G_7 = 1$ |
$1.000$ β |
Corrected calculation with backward recursion ($G_t = R_{t+1} + \gamma , G_{t+1}$)
| Time step |
Reward $R_{t+1}$
|
Calculation |
Corrected value |
| $G_7$ |
$R_8 = +1$ |
$1$ |
$1.000$ |
| $G_6$ |
$R_7 = 0$ |
$0 + 0.9 \times 1.000$ |
$0.900$ |
| $G_5$ |
$R_6 = 0$ |
$0 + 0.9 \times 0.900$ |
$0.810$ |
| $G_4$ |
$R_5 = 0$ |
$0 + 0.9 \times 0.810$ |
$0.729$ |
| $G_3$ |
$R_4 = 0$ |
$0 + 0.9 \times 0.729$ |
$0.656$ |
| $G_2$ |
$R_3 = 0$ |
$0 + 0.9 \times 0.656$ |
$0.590$ |
| $G_1$ |
$R_2 = 0$ |
$0 + 0.9 \times 0.590$ |
$0.531$ |
| $G_0$ |
$R_1 = 0$ |
$0 + 0.9 \times 0.531$ |
$0.478$ |
Episode 2: ABBBBBBBBBT β fail (final reward = β1)
The issue
The episode has 10 non-terminal states:
$$S_0=A,; S_1=B,; S_2=B,; S_3=B,; S_4=B,; S_5=B,; S_6=B,; S_7=B,; S_8=B,; S_9=B,; S_{10}=\text{fail (terminal)}$$
So $T = 10$, with 10 transitions and 10 rewards ($R_1$ through $R_{10}$), where $R_1 = \cdots = R_9 = 0$ and $R_{10} = -1$.
This episode has two types of errors:
Error 1: Off-by-one (same as Episode 1)
The book computes $G_0 = -1 \times \gamma^8 = -0.430$, but with 10 transitions it should be $G_0 = -1 \times \gamma^9 = -0.387$.
Error 2: Subscript typo β the book writes $G_0$ on almost every line
The book uses $G_0$ as the subscript for every time step instead of $G_t$, which is very confusing. Here is exactly what the book prints:
| Line in book |
Book writes (verbatim) |
Book value |
| $t = 0$ |
$G_0 = -1 \times \gamma^8$ |
$-0.430$ |
| $t = 1$ |
$G_0 = -1 \times \gamma^7$ |
$-0.478$ |
| $t = 2$ |
(not shown, implied by "...") |
β |
| $t = 3$ |
(not shown, implied by "...") |
β |
| $t = 4$ |
(not shown, implied by "...") |
β |
| $t = 5$ |
(not shown, implied by "...") |
β |
| $t = 6$ |
(not shown, implied by "...") |
β |
| $t = 7$ |
(not shown, implied by "...") |
β |
| $t = 8$ |
$G_0 = -1 \times \gamma$ |
$-0.900$ |
| $t = 9$ |
$G_{10} = -1$ |
$-1.000$ |
As you can see, the subscript is wrong on every shown line:
- At $t = 0$: writes $G_0$ β this one is actually correct
- At $t = 1$: writes $G_0$ β should be $G_1$
- At $t = 8$: writes $G_0$ β should be $G_8$
- At $t = 9$: writes $G_{10}$ β should be $G_9$
What the book should say (all steps, corrected)
| Time step |
Correct subscript |
Calculation |
Corrected value |
| $t = 0$ |
$G_0 = -1 \times \gamma^9$ |
$-0.9^9$ |
$-0.387$ |
| $t = 1$ |
$G_1 = -1 \times \gamma^8$ |
$-0.9^8$ |
$-0.430$ |
| $t = 2$ |
$G_2 = -1 \times \gamma^7$ |
$-0.9^7$ |
$-0.478$ |
| $t = 3$ |
$G_3 = -1 \times \gamma^6$ |
$-0.9^6$ |
$-0.531$ |
| $t = 4$ |
$G_4 = -1 \times \gamma^5$ |
$-0.9^5$ |
$-0.590$ |
| $t = 5$ |
$G_5 = -1 \times \gamma^4$ |
$-0.9^4$ |
$-0.656$ |
| $t = 6$ |
$G_6 = -1 \times \gamma^3$ |
$-0.9^3$ |
$-0.729$ |
| $t = 7$ |
$G_7 = -1 \times \gamma^2$ |
$-0.9^2$ |
$-0.810$ |
| $t = 8$ |
$G_8 = -1 \times \gamma^1$ |
$-0.9^1$ |
$-0.900$ |
| $t = 9$ |
$G_9 = -1$ |
$-1$ |
$-1.000$ |
Corrected calculation with backward recursion ($G_t = R_{t+1} + \gamma , G_{t+1}$)
| Time step |
Reward $R_{t+1}$
|
Calculation |
Corrected value |
| $G_9$ |
$R_{10} = -1$ |
$-1$ |
$-1.000$ |
| $G_8$ |
$R_9 = 0$ |
$0 + 0.9 \times (-1.000)$ |
$-0.900$ |
| $G_7$ |
$R_8 = 0$ |
$0 + 0.9 \times (-0.900)$ |
$-0.810$ |
| $G_6$ |
$R_7 = 0$ |
$0 + 0.9 \times (-0.810)$ |
$-0.729$ |
| $G_5$ |
$R_6 = 0$ |
$0 + 0.9 \times (-0.729)$ |
$-0.656$ |
| $G_4$ |
$R_5 = 0$ |
$0 + 0.9 \times (-0.656)$ |
$-0.590$ |
| $G_3$ |
$R_4 = 0$ |
$0 + 0.9 \times (-0.590)$ |
$-0.531$ |
| $G_2$ |
$R_3 = 0$ |
$0 + 0.9 \times (-0.531)$ |
$-0.478$ |
| $G_1$ |
$R_2 = 0$ |
$0 + 0.9 \times (-0.478)$ |
$-0.430$ |
| $G_0$ |
$R_1 = 0$ |
$0 + 0.9 \times (-0.430)$ |
$-0.387$ |
Summary
Both episodes share an off-by-one error: the calculations count one fewer transition than the episode actually has, shifting all return values by one power of $\gamma$. The values near the terminal state are correct, but the earlier time steps are all off.
Episode 2 additionally has a subscript typo: the book writes $G_0$ at every time step ($t = 0, 1, 8$) instead of the correct $G_0, G_1, G_8$, and writes $G_{10}$ instead of $G_9$ at the final step.
Thank you again for your work on this book, and I hope this note is helpful!