|
37 | 37 | "id": "p62G8M_viUJp"
|
38 | 38 | },
|
39 | 39 | "source": [
|
40 |
| - "# Playing CartPole with the Actor-Critic Method\n" |
| 40 | + "# Playing CartPole with the Actor-Critic method\n" |
41 | 41 | ]
|
42 | 42 | },
|
43 | 43 | {
|
|
74 | 74 | "id": "kFgN7h_wiUJq"
|
75 | 75 | },
|
76 | 76 | "source": [
|
77 |
| - "This tutorial demonstrates how to implement the [Actor-Critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) method using TensorFlow to train an agent on the [Open AI Gym](https://gym.openai.com/) CartPole-V0 environment.\n", |
78 |
| - "The reader is assumed to have some familiarity with [policy gradient methods](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) of reinforcement learning. \n" |
| 77 | + "This tutorial demonstrates how to implement the [Actor-Critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) method using TensorFlow to train an agent on the [Open AI Gym](https://gym.openai.com/) [`CartPole-v0`](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) environment.\n", |
| 78 | + "The reader is assumed to have some familiarity with [policy gradient methods](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) of [(deep) reinforcement learning](https://en.wikipedia.org/wiki/Deep_reinforcement_learning). \n" |
79 | 79 | ]
|
80 | 80 | },
|
81 | 81 | {
|
|
86 | 86 | "source": [
|
87 | 87 | "**Actor-Critic methods**\n",
|
88 | 88 | "\n",
|
89 |
| - "Actor-Critic methods are [temporal difference (TD) learning](https://en.wikipedia.org/wiki/Temporal_difference_learning) methods that represent the policy function independent of the value function. \n", |
| 89 | + "Actor-Critic methods are [temporal difference (TD) learning](https://en.wikipedia.org/wiki/Temporal_difference_learning) methods that represent the policy function independent of the value function.\n", |
90 | 90 | "\n",
|
91 | 91 | "A policy function (or policy) returns a probability distribution over actions that the agent can take based on the given state.\n",
|
92 | 92 | "A value function determines the expected return for an agent starting at a given state and acting according to a particular policy forever after.\n",
|
|
102 | 102 | "id": "rBfiafKSRs2k"
|
103 | 103 | },
|
104 | 104 | "source": [
|
105 |
| - "**CartPole-v0**\n", |
| 105 | + "**`CartPole-v0`**\n", |
106 | 106 | "\n",
|
107 |
| - "In the [CartPole-v0 environment](https://www.gymlibrary.dev/environments/classic_control/cart_pole/), a pole is attached to a cart moving along a frictionless track. \n", |
108 |
| - "The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of -1 or +1 to the cart. \n", |
109 |
| - "A reward of +1 is given for every time step the pole remains upright.\n", |
110 |
| - "An episode ends when (1) the pole is more than 15 degrees from vertical or (2) the cart moves more than 2.4 units from the center.\n", |
| 107 | + "In the [`CartPole-v0` environment](https://www.gymlibrary.dev/environments/classic_control/cart_pole/), a pole is attached to a cart moving along a frictionless track.\n", |
| 108 | + "The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of `-1` or `+1` to the cart.\n", |
| 109 | + "A reward of `+1` is given for every time step the pole remains upright.\n", |
| 110 | + "An episode ends when: 1) the pole is more than 15 degrees from vertical; or 2) the cart moves more than 2.4 units from the center.\n", |
111 | 111 | "\n",
|
112 | 112 | "<center>\n",
|
113 | 113 | " <figure>\n",
|
|
203 | 203 | "id": "AOUCe2D0iUJu"
|
204 | 204 | },
|
205 | 205 | "source": [
|
206 |
| - "## Model\n", |
| 206 | + "## The model\n", |
207 | 207 | "\n",
|
208 |
| - "The *Actor* and *Critic* will be modeled using one neural network that generates the action probabilities and critic value respectively. This tutorial uses model subclassing to define the model. \n", |
| 208 | + "The *Actor* and *Critic* will be modeled using one neural network that generates the action probabilities and Critic value respectively. This tutorial uses model subclassing to define the model. \n", |
209 | 209 | "\n",
|
210 | 210 | "During the forward pass, the model will take in the state as the input and will output both action probabilities and critic value $V$, which models the state-dependent [value function](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions). The goal is to train a model that chooses actions based on a policy $\\pi$ that maximizes expected [return](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#reward-and-return).\n",
|
211 | 211 | "\n",
|
212 |
| - "For Cartpole-v0, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (0) and right (1) respectively.\n", |
| 212 | + "For `CartPole-v0`, there are four values representing the state: cart position, cart-velocity, pole angle and pole velocity respectively. The agent can take two actions to push the cart left (`0`) and right (`1`), respectively.\n", |
213 | 213 | "\n",
|
214 |
| - "Refer to [OpenAI Gym's CartPole-v0 wiki page](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) for more information.\n" |
| 214 | + "Refer to [Gym's Cart Pole documentation page](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) and ["Neuronlike adaptive elements that can solve difficult learning control problems"](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) by Barto, Sutton and Anderson (1983) for more information.\n" |
215 | 215 | ]
|
216 | 216 | },
|
217 | 217 | {
|
|
261 | 261 | "id": "hk92njFziUJw"
|
262 | 262 | },
|
263 | 263 | "source": [
|
264 |
| - "## Training\n", |
| 264 | + "## Train the agent\n", |
265 | 265 | "\n",
|
266 | 266 | "To train the agent, you will follow these steps:\n",
|
267 | 267 | "\n",
|
268 | 268 | "1. Run the agent on the environment to collect training data per episode.\n",
|
269 | 269 | "2. Compute expected return at each time step.\n",
|
270 |
| - "3. Compute the loss for the combined actor-critic model.\n", |
| 270 | + "3. Compute the loss for the combined Actor-Critic model.\n", |
271 | 271 | "4. Compute gradients and update network parameters.\n",
|
272 | 272 | "5. Repeat 1-4 until either success criterion or max episodes has been reached.\n"
|
273 | 273 | ]
|
|
278 | 278 | "id": "R2nde2XDs8Gh"
|
279 | 279 | },
|
280 | 280 | "source": [
|
281 |
| - "### 1. Collecting training data\n", |
| 281 | + "### 1. Collect training data\n", |
282 | 282 | "\n",
|
283 | 283 | "As in supervised learning, in order to train the actor-critic model, you need\n",
|
284 | 284 | "to have training data. However, in order to collect such data, the model would\n",
|
|
299 | 299 | },
|
300 | 300 | "outputs": [],
|
301 | 301 | "source": [
|
302 |
| - "# Wrap OpenAI Gym's `env.step` call as an operation in a TensorFlow function.\n", |
| 302 | + "# Wrap Gym's `env.step` call as an operation in a TensorFlow function.\n", |
303 | 303 | "# This would allow it to be included in a callable TensorFlow graph.\n",
|
304 | 304 | "\n",
|
305 | 305 | "def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:\n",
|
|
377 | 377 | "id": "lBnIHdz22dIx"
|
378 | 378 | },
|
379 | 379 | "source": [
|
380 |
| - "### 2. Computing expected returns\n", |
| 380 | + "### 2. Compute the expected returns\n", |
381 | 381 | "\n",
|
382 | 382 | "The sequence of rewards for each timestep $t$, $\\{r_{t}\\}^{T}_{t=1}$ collected during one episode is converted into a sequence of expected returns $\\{G_{t}\\}^{T}_{t=1}$ in which the sum of rewards is taken from the current timestep $t$ to $T$ and each reward is multiplied with an exponentially decaying discount factor $\\gamma$:\n",
|
383 | 383 | "\n",
|
|
432 | 432 | "id": "qhr50_Czxazw"
|
433 | 433 | },
|
434 | 434 | "source": [
|
435 |
| - "### 3. The actor-critic loss\n", |
| 435 | + "### 3. The Actor-Critic loss\n", |
436 | 436 | "\n",
|
437 |
| - "Since a hybrid actor-critic model is used, the chosen loss function is a combination of actor and critic losses for training, as shown below:\n", |
| 437 | + "Since you're using a hybrid Actor-Critic model, the chosen loss function is a combination of Actor and Critic losses for training, as shown below:\n", |
438 | 438 | "\n",
|
439 | 439 | "$$L = L_{actor} + L_{critic}$$"
|
440 | 440 | ]
|
|
445 | 445 | "id": "nOQIJuG1xdTH"
|
446 | 446 | },
|
447 | 447 | "source": [
|
448 |
| - "#### Actor loss\n", |
| 448 | + "#### The Actor loss\n", |
449 | 449 | "\n",
|
450 |
| - "The actor loss is based on [policy gradients with the critic as a state dependent baseline](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s) and computed with single-sample (per-episode) estimates.\n", |
| 450 | + "The Actor loss is based on [policy gradients with the Critic as a state dependent baseline](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s) and computed with single-sample (per-episode) estimates.\n", |
451 | 451 | "\n",
|
452 | 452 | "$$L_{actor} = -\\sum^{T}_{t=1} \\log\\pi_{\\theta}(a_{t} | s_{t})[G(s_{t}, a_{t}) - V^{\\pi}_{\\theta}(s_{t})]$$\n",
|
453 | 453 | "\n",
|
454 | 454 | "where:\n",
|
455 | 455 | "- $T$: the number of timesteps per episode, which can vary per episode\n",
|
456 | 456 | "- $s_{t}$: the state at timestep $t$\n",
|
457 | 457 | "- $a_{t}$: chosen action at timestep $t$ given state $s$\n",
|
458 |
| - "- $\\pi_{\\theta}$: is the policy (actor) parameterized by $\\theta$\n", |
459 |
| - "- $V^{\\pi}_{\\theta}$: is the value function (critic) also parameterized by $\\theta$\n", |
| 458 | + "- $\\pi_{\\theta}$: is the policy (Actor) parameterized by $\\theta$\n", |
| 459 | + "- $V^{\\pi}_{\\theta}$: is the value function (Critic) also parameterized by $\\theta$\n", |
460 | 460 | "- $G = G_{t}$: the expected return for a given state, action pair at timestep $t$\n",
|
461 | 461 | "\n",
|
462 | 462 | "A negative term is added to the sum since the idea is to maximize the probabilities of actions yielding higher rewards by minimizing the combined loss.\n",
|
|
470 | 470 | "id": "Y304O4OAxiAv"
|
471 | 471 | },
|
472 | 472 | "source": [
|
473 |
| - "##### Advantage\n", |
| 473 | + "##### The Advantage\n", |
474 | 474 | "\n",
|
475 |
| - "The $G - V$ term in our $L_{actor}$ formulation is called the [advantage](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions), which indicates how much better an action is given a particular state over a random action selected according to the policy $\\pi$ for that state.\n", |
| 475 | + "The $G - V$ term in our $L_{actor}$ formulation is called the [Advantage](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions), which indicates how much better an action is given a particular state over a random action selected according to the policy $\\pi$ for that state.\n", |
476 | 476 | "\n",
|
477 | 477 | "While it's possible to exclude a baseline, this may result in high variance during training. And the nice thing about choosing the critic $V$ as a baseline is that it trained to be as close as possible to $G$, leading to a lower variance.\n",
|
478 | 478 | "\n",
|
479 |
| - "In addition, without the critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.\n", |
| 479 | + "In addition, without the Critic, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return, which may not make much of a difference if the relative probabilities between actions remain the same.\n", |
480 | 480 | "\n",
|
481 |
| - "For instance, suppose that two actions for a given state would yield the same expected return. Without the critic, the algorithm would try to raise the probability of these actions based on the objective $J$. With the critic, it may turn out that there's no advantage ($G - V = 0$) and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.\n", |
| 481 | + "For instance, suppose that two actions for a given state would yield the same expected return. Without the Critic, the algorithm would try to raise the probability of these actions based on the objective $J$. With the Critic, it may turn out that there's no Advantage ($G - V = 0$), and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.\n", |
482 | 482 | "\n",
|
483 | 483 | "<br>"
|
484 | 484 | ]
|
|
489 | 489 | "id": "1hrPLrgGxlvb"
|
490 | 490 | },
|
491 | 491 | "source": [
|
492 |
| - "#### Critic loss\n", |
| 492 | + "#### The Critic loss\n", |
493 | 493 | "\n",
|
494 | 494 | "Training $V$ to be as close possible to $G$ can be set up as a regression problem with the following loss function:\n",
|
495 | 495 | "\n",
|
|
512 | 512 | " action_probs: tf.Tensor, \n",
|
513 | 513 | " values: tf.Tensor, \n",
|
514 | 514 | " returns: tf.Tensor) -> tf.Tensor:\n",
|
515 |
| - " \"\"\"Computes the combined actor-critic loss.\"\"\"\n", |
| 515 | + " \"\"\"Computes the combined Actor-Critic loss.\"\"\"\n", |
516 | 516 | "\n",
|
517 | 517 | " advantage = returns - values\n",
|
518 | 518 | "\n",
|
|
530 | 530 | "id": "HSYkQOmRfV75"
|
531 | 531 | },
|
532 | 532 | "source": [
|
533 |
| - "### 4. Defining the training step to update parameters\n", |
| 533 | + "### 4. Define the training step to update parameters\n", |
534 | 534 | "\n",
|
535 | 535 | "All of the steps above are combined into a training step that is run every episode. All steps leading up to the loss function are executed with the `tf.GradientTape` context to enable automatic differentiation.\n",
|
536 | 536 | "\n",
|
|
567 | 567 | " action_probs, values, rewards = run_episode(\n",
|
568 | 568 | " initial_state, model, max_steps_per_episode) \n",
|
569 | 569 | "\n",
|
570 |
| - " # Calculate expected returns\n", |
| 570 | + " # Calculate the expected returns\n", |
571 | 571 | " returns = get_expected_return(rewards, gamma)\n",
|
572 | 572 | "\n",
|
573 | 573 | " # Convert training data to appropriate TF tensor shapes\n",
|
574 | 574 | " action_probs, values, returns = [\n",
|
575 | 575 | " tf.expand_dims(x, 1) for x in [action_probs, values, returns]] \n",
|
576 | 576 | "\n",
|
577 |
| - " # Calculating loss values to update our network\n", |
| 577 | + " # Calculate the loss values to update our network\n", |
578 | 578 | " loss = compute_loss(action_probs, values, returns)\n",
|
579 | 579 | "\n",
|
580 | 580 | " # Compute the gradients from the loss\n",
|
|
598 | 598 | "\n",
|
599 | 599 | "Training is executed by running the training step until either the success criterion or maximum number of episodes is reached. \n",
|
600 | 600 | "\n",
|
601 |
| - "A running record of episode rewards is kept in a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency. \n", |
| 601 | + "A running record of episode rewards is kept in a queue. Once 100 trials are reached, the oldest reward is removed at the left (tail) end of the queue and the newest one is added at the head (right). A running sum of the rewards is also maintained for computational efficiency.\n", |
602 | 602 | "\n",
|
603 | 603 | "Depending on your runtime, training can finish in less than a minute."
|
604 | 604 | ]
|
|
617 | 617 | "max_episodes = 10000\n",
|
618 | 618 | "max_steps_per_episode = 500\n",
|
619 | 619 | "\n",
|
620 |
| - "# Cartpole-v1 is considered solved if average reward is >= 475 over 500 \n", |
| 620 | + "# `CartPole-v1` is considered solved if average reward is >= 475 over 500 \n", |
621 | 621 | "# consecutive trials\n",
|
622 | 622 | "reward_threshold = 475\n",
|
623 | 623 | "running_reward = 0\n",
|
624 | 624 | "\n",
|
625 |
| - "# Discount factor for future rewards\n", |
| 625 | + "# The discount factor for future rewards\n", |
626 | 626 | "gamma = 0.99\n",
|
627 | 627 | "\n",
|
628 |
| - "# Keep last episodes reward\n", |
| 628 | + "# Keep the last episodes reward\n", |
629 | 629 | "episodes_reward: collections.deque = collections.deque(maxlen=min_episodes_criterion)\n",
|
630 | 630 | "\n",
|
631 | 631 | "t = tqdm.trange(max_episodes)\n",
|
|
642 | 642 | " t.set_postfix(\n",
|
643 | 643 | " episode_reward=episode_reward, running_reward=running_reward)\n",
|
644 | 644 | " \n",
|
645 |
| - " # Show average episode reward every 10 episodes\n", |
| 645 | + " # Show the average episode reward every 10 episodes\n", |
646 | 646 | " if i % 10 == 0:\n",
|
647 | 647 | " pass # print(f'Episode {i}: average reward: {avg_reward}')\n",
|
648 | 648 | " \n",
|
|
660 | 660 | "source": [
|
661 | 661 | "## Visualization\n",
|
662 | 662 | "\n",
|
663 |
| - "After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for OpenAI Gym to render the environment's images correctly in Colab." |
| 663 | + "After training, it would be good to visualize how the model performs in the environment. You can run the cells below to generate a GIF animation of one episode run of the model. Note that additional packages need to be installed for Gym to render the environment's images correctly in Colab." |
664 | 664 | ]
|
665 | 665 | },
|
666 | 666 | {
|
|
731 | 731 | "source": [
|
732 | 732 | "## Next steps\n",
|
733 | 733 | "\n",
|
734 |
| - "This tutorial demonstrated how to implement the actor-critic method using Tensorflow.\n", |
| 734 | + "This tutorial demonstrated how to implement the Actor-Critic method using Tensorflow.\n", |
735 | 735 | "\n",
|
736 |
| - "As a next step, you could try training a model on a different environment in OpenAI Gym. \n", |
| 736 | + "As a next step, you could try training a model on a different environment in Gym. \n", |
737 | 737 | "\n",
|
738 |
| - "For additional information regarding actor-critic methods and the Cartpole-v0 problem, you may refer to the following resources:\n", |
| 738 | + "For additional information regarding Actor-Critic methods and the Cartpole-v0 problem, you may refer to the following resources:\n", |
739 | 739 | "\n",
|
740 |
| - "- [Actor Critic Method](https://hal.inria.fr/hal-00840470/document)\n", |
741 |
| - "- [Actor Critic Lecture (CAL)](https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=7&t=0s)\n", |
742 |
| - "- [Cartpole learning control problem \\[Barto, et al. 1983\\]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) \n", |
| 740 | + "- [The Actor-Critic method](https://hal.inria.fr/hal-00840470/document)\n", |
| 741 | + "- [The Actor-Critic lecture (CAL)](https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=7&t=0s)\n", |
| 742 | + "- [Cart Pole learning control problem \\[Barto, et al. 1983\\]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf) \n", |
743 | 743 | "\n",
|
744 | 744 | "For more reinforcement learning examples in TensorFlow, you can check the following resources:\n",
|
745 | 745 | "- [Reinforcement learning code examples (keras.io)](https://keras.io/examples/rl/)\n",
|
|
0 commit comments