|
| 1 | +# Lecture 22: Reinforcement Learning |
| 2 | +**Instructors:** Vishal, Tejas |
| 3 | +**Date:** June 13, 2025 |
| 4 | + |
| 5 | +## 📖 Topics Covered: |
| 6 | + |
| 7 | +- **1. Why Reinforcement Learning?** |
| 8 | + - Why is it hard to generate data for robots with frequently changing morphologies? |
| 9 | + - Why are traditional approaches (e.g., explicit physics models, controllers) inefficient for skill learning? |
| 10 | + - How does RL (and supervised learning) help bridge this gap? |
| 11 | + - What is an example where RL enabled fast adaptation (e.g., quadrupeds using rapid motor adaptation)? |
| 12 | + |
| 13 | +- **2. RL Notation and Terminology** |
| 14 | + - What are stochastic processes and the Markovian property? |
| 15 | + - What is a Markov Decision Process (MDP), and how is it defined? |
| 16 | + |
| 17 | +- **3. Anatomy of the Reinforcement Learning Pipeline** |
| 18 | + - How do we collect samples from the environment using the current policy? |
| 19 | + - What does model fitting or sample evaluation involve? |
| 20 | + - How is the policy improved based on evaluation? |
| 21 | + - How do modern simulators and sim-to-real transfer help overcome sample collection bottlenecks? |
| 22 | + |
| 23 | +- **4. Policy Gradient Methods** |
| 24 | + |
| 25 | + **4.1 Goal of RL** |
| 26 | + - What is the objective function \( J(\theta) \) in RL? |
| 27 | + - How does the formulation differ in finite vs. infinite horizon settings? |
| 28 | + - Why is the goal to maximize expected return? |
| 29 | + |
| 30 | + **4.2 Policy Gradient** |
| 31 | + - How do we compute the gradient of the objective function? |
| 32 | + - What is the REINFORCE trick and algorithm? |
| 33 | + |
| 34 | +- **5. Reducing Variance in REINFORCE** |
| 35 | + - Why does REINFORCE have high variance despite being unbiased? |
| 36 | + - How does the reward-to-go trick exploit causality to reduce variance? |
| 37 | + - What are baseline methods for variance reduction? |
| 38 | + - How do we choose an optimal baseline to minimize variance? |
| 39 | + - What are actor-critic methods, and how do they combine value estimation with policy updates? |
| 40 | + |
| 41 | +- **6. Value-Based Methods** |
| 42 | + - Value function and Q-function |
| 43 | + - What are SARSA and Q-learning? |
| 44 | + - How does Deep Q-Learning extend traditional Q-learning? |
| 45 | + |
| 46 | + |
| 47 | +## 📄 Assignment |
| 48 | + |
| 49 | +- 🧠 **Policy Gradient & Actor-Critic Walkthrough:** |
| 50 | + Open the following Colab notebook to implement and experiment with Policy Gradient methods from scratch: |
| 51 | + [](https://colab.research.google.com/drive/1TWPHz3udlKqsdSyMvTiZG9Y5P7VrY3gH?usp=sharing) |
| 52 | + |
| 53 | + This walkthrough is designed to help you implement a working **Policy Gradient agent** using PyTorch on environments like *CartPole*. |
| 54 | + |
| 55 | + --- |
| 56 | + |
| 57 | + **📚 What You'll Learn** |
| 58 | + - Core ideas behind Policy Gradient algorithms |
| 59 | + - How to implement and train a neural network policy |
| 60 | + - How to collect rollouts and compute returns |
| 61 | + - Policy updates using gradient ascent |
| 62 | + - (Optional) Baseline methods & Generalized Advantage Estimation (GAE) |
| 63 | + |
| 64 | + **🛠 Prerequisites** |
| 65 | + - Python + PyTorch basics |
| 66 | + - Key RL concepts: Policy, Reward, Return, Advantage, Value Function |
| 67 | + |
| 68 | + **🗂 Notebook Structure** |
| 69 | + - **Environment Setup**: Logging and configuration |
| 70 | + - **Policy Network**: Implementation and sampling |
| 71 | + - **Training Loop**: Computing returns and updating the policy |
| 72 | + - **Variance Reduction (Optional)**: Baselines, GAE for stability |
| 73 | + |
| 74 | + **👨🏫 Tips for Students** |
| 75 | + - Run cells in order — don’t skip! |
| 76 | + - Print out observations, actions, rewards to debug. |
| 77 | + - Try different hyperparameters and Gym environments. |
| 78 | + - Use TensorBoard or video logs to visualize progress. |
| 79 | + |
| 80 | + > 📘 Inspired by [CS285: Deep RL (Berkeley)](https://rail.eecs.berkeley.edu/deeprlcourse/) |
| 81 | + |
| 82 | + _Courtesy: Tejas_ |
| 83 | + |
| 84 | +📢 Do post doubts on the `#module-7-robot-learning` Slack channel! |
| 85 | + |
| 86 | +## 🔗 Resources |
| 87 | + |
| 88 | +| 📚 Topic | 🔗 Link | |
| 89 | +|----------|---------| |
| 90 | +|Lecture Slides -- Reinforcement Learning| See Lectures 4-7 from RAIL Course (linked below) | |
| 91 | +| 🎓 Deep Reinforcement Learning – Sergey Levine (RAIL, Berkeley) | [](https://rail.eecs.berkeley.edu/deeprlcourse/) | |
| 92 | +| 🧠 Policy Gradient Algorithms – Lilian Weng | [](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/) | |
| 93 | +| ⚙️ PPO Implementation Details – ICLR Blog Track | [](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) | |
| 94 | +| 📘 Mathematical Foundations of RL – Shiyu Zhao (Westlake University) | [](https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning) | |
| 95 | +| ⚡ RL Quickstart Guide – Joseph Suarez (Pufferlib Creator) | [](https://x.com/jsuarez5341/status/1854855861295849793) | |
| 96 | +| 📦 Stable Baselines3 – RL Library (DLR-RM) | [](https://github.com/DLR-RM/stable-baselines3) | |
| 97 | +| 🧼 CleanRL – Minimal RL Implementations | [](https://github.com/vwxyzjn/cleanrl) | |
| 98 | +| 🐉 Decisions & Dragons – FAQs About RL | [](https://www.decisionsanddragons.com/) | |
| 99 | + |
| 100 | +--- |
0 commit comments