last modifications

govin08 · govin08 · commit 012bf678b75e · 2025-09-20T00:15:07.000+09:00
diff --git a/_posts/2025-09-18-dp.md b/_posts/2025-09-18-dp.md
@@ -27,10 +27,9 @@ Sutton의 3장은 꽤 책의 내용과 비슷하게 썼다.
 
 # 4. Dynamic Programming
 
-## 4.1 Bellman optimal equation revisited
-
-Bellman optimal equation도 다시 적어보자.
-optimal value $v_\ast$, $q_\ast$에 대한 optimal equation들은 다음과 같다.
+4장의 맨 처음에 나오는 것은 Bellman optimal equation이다.
+이 장에서는 DP를 이용해 optimal policy를 얻어내는 과정을 설명하고 있으니 optimal policy에 관한 다음 식이 중요한 것은 당연하다.
+optimal value $v_\ast$, $q_\ast$에 대한 optimal equation들을 다시 써보면 다음과 같다.
 
 $$
 \begin{align*}
@@ -43,9 +42,9 @@ q_\ast(s,a)
 \end{align*}
 $$
 
-## 4.2 Bellman equation revisited
+## 4.1 Bellman equation revisited
 
-책의 4.1절에 가장 먼저 보이는 식은 $v$에 대한 Bellman equation
+$v_\pi$에 대한 Bellman equation은 이전 포스트에서 썼지만 다시 적어보자.
 
 $$
 \begin{align*}
@@ -57,7 +56,6 @@ v_\pi(s)
 \end{align*}
 $$
 
-이다.
 첫번째 줄과 두번째줄이 같다는 것, 그리고 그것이 네번째 줄과 같다는 것은 이전 포스트에서 증명했고, 그것을 Bellman equation이라고 했었다.
 그러나 세번째 줄은 조금 뜬금없어보인다.
 그래, 의미상으로는 당연히 그럴 것 같은데 왜 그런 지는 그렇게까지 쉽게 설명되지 않는다.
@@ -67,18 +65,18 @@ $$
 
 $$
 \begin{align*}
-&\mathbb E\left[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s\right]\\
+&\mathbb E_\pi\left[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s\right]\\
 =&\sum_a\pi(a|s)\mathbb E\left[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s,A_t=a\right]\\
-=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\mathbb E\left[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s,A_t=a,R_{t+1}=r,S_{t+1}=s'\right]\\
-=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\mathbb E\left[r+\gamma v_\pi(S_{t+1})|S_{t+1}=s'\right]\\
-=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\mathbb E\left[r+\gamma v_\pi(s')\right]\\
+=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\mathbb E_\pi\left[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s,A_t=a,R_{t+1}=r,S_{t+1}=s'\right]\\
+=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\mathbb E_\pi\left[r+\gamma v_\pi(S_{t+1})|S_{t+1}=s'\right]\\
+=&\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)\mathbb E_\pi\left[r+\gamma v_\pi(s')\right]\\
 \end{align*}
 $$
 
-## 4.3 policy evaluation
+## 4.2 policy evaluation
 
 책의 4.1절에서 다루는 것은, 주어진 정책 $\pi$에 대하여 이에 대한 가치함수 $v_\pi$를 얻어내는 것이다.
-즉 정책을 평가하는(policy evaluation, prediction problem) 것으로서 DP를 포함한 모든 강화학습에서의 주요한 두 과정 중 하나이다.
+즉 정책을 평가하는(policy evaluation, prediction problem) 것으로서 DP를 포함한 모든 강화학습에서의 중요한 두 과정 중 하나이다.
 
 가치함수를 얻어내는 방식은 식 (4.4)을
 
@@ -89,7 +87,7 @@ $$
 와 같이 변형해 가치함수들의 수열 $v_0, v_1, v_2, \cdots$을 만들어나가는 것이다.
 $v_0$가 임의의 함수(e.g. $v_0\equiv0$)이고 $v_\pi$가 존재한다는 조건 하에 수열 $\\{v_i\\}$가 $v_\pi$로 수렴하는 것이 알려져 있고, 이를 증명하려 한다.
 
-## 4.4 Bellman operation
+## 4.3 Bellman operation
 
 먼저 할 것은 식 (4.4) 버전의 Bellman equation을 Bellman operation으로 표현하는 것이다.
 기본적으로 Carl Fredricksson의 자료를 따라갔다.