improving notebooks on RL, more work to do [skip ci]

BDonnot · BDonnot · commit e25da0fa84f4 · 2024-06-17T17:24:29.000+02:00
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -31,16 +31,24 @@ Change Log
 - [???] "asynch" multienv
 - [???] properly model interconnecting powerlines
 
-
+Work kind of in progress
+----------------------------------
 - TODO A number of max buses per sub
 - TODO in the runner, save multiple times the same sceanrio
 - TODO in the gym env, make the action_space and observation_space attribute
   filled automatically (see ray integration, it's boring to have to copy paste...)
 
+Next release
+---------------------------------
+- TODO Notebook for tf_agents
+- TODO Notebook for acme
+- TODO Notebook using "keras rl" (see https://keras.io/examples/rl/ppo_cartpole/)
+- TODO put the Grid2opEnvWrapper directly in grid2op as GymEnv
+- TODO example for MCTS https://github.com/bwfbowen/muax et https://github.com/google-deepmind/mctx
+
 [1.10.3] - 2024-xx-yy
 -------------------------
 - TODO Automatic "experimental_read_from_local_dir"
-- TODO Notebook for stable baselines
 
 - [BREAKING] `env.chronics_hander.set_max_iter(xxx)` is now a private function. Use 
   `env.set_max_iter(xxx)` or even better `env.reset(options={"max step": xxx})`. 
@@ -60,7 +68,7 @@ Change Log
 
 [1.10.2] - 2024-05-27
 -------------------------
-- [BREAKING] the `runner.run_one_episode` now returns an extra first argument: 
+- [BREAKING] the `runner.run_one_episode` now returns an extra argument (first position): 
   `chron_id, chron_name, cum_reward, timestep, max_ts = runner.run_one_episode()` which 
   is consistant with `runner.run(...)` (previously it returned only 
   `chron_name, cum_reward, timestep, max_ts = runner.run_one_episode()`)
diff --git a/getting_started/11_IntegrationWithExistingRLFrameworks.ipynb b/getting_started/11_IntegrationWithExistingRLFrameworks.ipynb
@@ -29,7 +29,6 @@
     "\n",
     "Other RL frameworks are not cover here. If you already use them, let us know !\n",
     "- https://github.com/PaddlePaddle/PARL/blob/develop/README.md (used by the winner teams of Neurips competitions !) Work in progress.\n",
-    "- https://github.com/wau/keras-rl2\n",
     "- https://github.com/deepmind/acme\n",
     "\n",
     "Note also that there is still the possibility to use past codes in the l2rpn-baselines repository: https://github.com/rte-france/l2rpn-baselines . This repository contains code snippets that can be reuse to make really nice agents on the l2rpn competitions. You can try it out :-) \n",
@@ -85,11 +84,13 @@
     "- [Action space](#Action-space): basic usage of the action space, by removing redundant feature (`gym_env.observation_space.ignore_attr`) or transforming feature from a continuous space to a discrete space (`ContinuousToDiscreteConverter`)\n",
     "- [Observation space](#Observation-space): basic usage of the observation space, by removing redunddant features (`keep_only_attr`) or to scale the data on between a certain range (`ScalerAttrConverter`)\n",
     "- [Making the grid2op agent](#Making-the-grid2op-agent) explains how to make a grid2op agent once trained. Note that a more \"agent focused\" view is provided in the notebook [04_TrainingAnAgent](04_TrainingAnAgent.ipynb) !\n",
-    "- [1) RLLIB](#1\\)-RLLIB): more advance usage for customizing the observation space (`gym_env.observation_space.reencode_space` and `gym_env.observation_space.add_key`) or modifying the type of gym attribute (`MultiToTupleConverter`) as well as an example of how to use RLLIB framework\n",
-    "- [2)-Stable baselines](#2\\)-Stable-baselines): even more advanced usage for customizing the observation space by concatenating it to a single \"Box\" (instead of a dictionnary) thanks to `BoxGymObsSpace` and to use `BoxGymActSpace` if you are more focus on continuous actions and `MultiDiscreteActSpace` for discrete actions (**NB** in both case there will be loss of information as compared to regular grid2op actions! for example it will be harder to have a representation of the graph of the grid there)\n",
-    "- [3) Tf Agents](#3\\)-Tf-Agents) explains how to convert the action space into a \"Discrete\" gym space thanks to `DiscreteActSpace`\n",
     "\n",
-    "On each sections, we also explain concisely how to train the agent. Note that we did not spend any time on customizing the default agents and training scheme. It is then less than likely that these agents there"
+    "To dive deeper and with proper \"hands on\", you can refer to one of the following notebooks that uses real RL frameworks:\n",
+    "\n",
+    "1) RLLIB: see notebook [11_ray_integration](./11_ray_integration.ipynb) for more information about RLLIB\n",
+    "2) Stable baselines: see notebook [11_ray_integration](./11_stable_baselines3_integration.ipynb) for more information about stables-baselines3\n",
+    "3) tf agents: coming soon\n",
+    "4) acme: coming soon"
    ]
   },
   {
@@ -1316,7 +1317,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,
diff --git a/getting_started/11_ray_integration.ipynb b/getting_started/11_ray_integration.ipynb
@@ -15,11 +15,17 @@
     "\n",
     "This notebook is more an \"example of what works\" rather than a deep dive tutorial.\n",
     "\n",
-    "See stable-baselines3.readthedocs.io/ for a more detailed information.\n",
+    "See https://docs.ray.io/en/latest/rllib/rllib-env.html#configuring-environments for a more detailed information.\n",
     "\n",
-    "This notebook is tested with grid2op 1.10.2 and stable baselines3 version 2.3.2 on an ubuntu 20.04 machine.\n",
+    "See also https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html for other details\n",
     "\n",
+    "This notebook is tested with grid2op 1.10.2 and ray 2.9 on an ubuntu 20.04 machine.\n",
     "\n",
+    "- [0 Some tips to get started](#0-some-tips-to-get-started) : is a reminder on what you can do to make things work. Indeed, this notebook explains \"how to use grid2op with stable baselines\" but not \"how to create a working agent able to operate a real powergrid in real time with stable baselines\". We wish we could explain the later...\n",
+    "- [1 Create the \"Grid2opEnvWrapper\" class](#1-create-the-grid2openvwraper-class) : explain how to create the main grid2op env class that you can use a \"gymnasium\" environment. \n",
+    "- [2 Create an environment, and train a first policy](#2-create-an-environment-and-train-a-first-policy): show how to create an environment from the class above (is pretty easy)\n",
+    "- [3 Evaluate the trained agent ](#3-evaluate-the-trained-agent): show how to evaluate the trained \"agent\"\n",
+    "- [4 Some customizations](#4-some-customizations): explain how to perform some customization of your agent / environment / policy\n",
     "## 0 Some tips to get started\n",
     "\n",
     "<font color='red'> It is unlikely that \"simply\" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments.</font>\n",
@@ -62,7 +68,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "## 1 Create the \"Grid2opEnv\" class\n",
+    "## 1 Create the \"Grid2opEnvWrapper\" class\n",
     "\n",
     "In the next cell, we define a custom environment (that will internally use the `GymEnv` grid2op class). It is not strictly needed\n",
     "\n",
@@ -102,12 +108,14 @@
    "source": [
     "from gymnasium import Env\n",
     "from gymnasium.spaces import Discrete, MultiDiscrete, Box\n",
+    "import json\n",
     "\n",
     "import ray\n",
     "from ray.rllib.algorithms.ppo import PPOConfig\n",
     "from ray.rllib.algorithms import ppo\n",
     "\n",
     "from typing import Dict, Literal, Any\n",
+    "import copy\n",
     "\n",
     "import grid2op\n",
     "from grid2op.gym_compat import GymEnv, BoxGymObsSpace, DiscreteActSpace, BoxGymActSpace, MultiDiscreteActSpace\n",
@@ -201,9 +209,13 @@
     "        else:\n",
     "            raise NotImplementedError(f\"action type '{act_type}' is not currently supported.\")\n",
     "            \n",
-    "            \n",
-    "    def reset(self, seed, options):\n",
+    "    def reset(self, seed=None, options=None):\n",
     "        # use default _gym_env (from grid2op.gym_compat module)\n",
+    "        # NB: here you can also specify \"default options\" when you reset, for example:\n",
+    "        # - limiting the duration of the episode \"max step\"\n",
+    "        # - starting at different steps  \"init ts\"\n",
+    "        # - study difficult scenario   \"time serie id\"\n",
+    "        # - specify an initial state of your grid \"init state\"\n",
     "        return self._gym_env.reset(seed=seed, options=options)\n",
     "        \n",
     "    def step(self, action):\n",
@@ -216,23 +228,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now we init ray, because we need to."
+    "## 2 Create an environment, and train a first policy"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "ray.init()"
+    "Now we init ray, because we need to."
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "## 2 Make a default environment, and train a PPO agent for one iteration"
+    "ray.init()"
    ]
   },
   {
@@ -279,7 +291,58 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3 Train a PPO agent using 2 \"runners\" to make the rollouts\n",
+    "## 3 Evaluate the trained agent\n",
+    "\n",
+    "This notebook is a simple quick introduction for stable baselines only. So we don't really recall everything that has been said previously.\n",
+    "\n",
+    "Please consult the section `0) Recommended initial steps` of the notebook [11_IntegrationWithExistingRLFrameworks](./11_IntegrationWithExistingRLFrameworks.ipynb) for more information.\n",
+    "\n",
+    "**TLD;DR** grid2op offers the possibility to test your agent on scenarios / episodes different from the one it has been trained. We greatly encourage you to use this functionality.\n",
+    "\n",
+    "There are two main ways to evaluate your agent:\n",
+    "\n",
+    "- you stay in the \"gymnasium\" world (see [here](#31-staying-in-the-gymnasium-ecosystem) ) and you evaluate your policy directly just like you would any other gymnasium compatible environment. Simple, easy but without support for some grid2op features\n",
+    "- you \"get back\" to the \"grid2op\" world (detailed [here](#32-using-the-grid2op-ecosystem)) by \"converting\" your NN policy into something that is able to output grid2op like action. This introduces yet again a \"wrapper\" but you can benefit from all grid2op features, such as the `Runner` to save an inspect what your policy has done.\n",
+    "\n",
+    "<font color='red'> We show here just a simple examples to \"get easily started\". For much better working agents, you can have a look at l2rpn-baselines code. There you have classes that maps the environment, the agents etc. to grid2op directly (you don't have to copy paste any wrapper).</font> \n",
+    "\n",
+    "\n",
+    "\n",
+    "### 3.1 staying in the gymnasium ecosystem\n",
+    "\n",
+    "You can do pretty much what you want, but you have to do it yourself, or use any of the \"Wrappers\" available in gymnasium https://gymnasium.farama.org/main/api/wrappers/ (*eg* https://gymnasium.farama.org/main/api/wrappers/misc_wrappers/#gymnasium.wrappers.RecordEpisodeStatistics) or in your RL framework.\n",
+    "\n",
+    "For the sake of simplicity, we show how to do things \"manually\" even though we do not recommend to do it like that."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.2 using the grid2op environment"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4 some customizations\n",
+    "\n",
+    "### 4.1 Train a PPO agent using 2 \"runners\" to make the rollouts\n",
     "\n",
     "In this second example, we explain briefly how to train the model using 2 \"processes\". This is, the agent will interact with 2 agents at the same time during the \"rollout\" phases.\n",
     "\n",
@@ -296,7 +359,7 @@
     "\n",
     "# use multiple runners\n",
     "config2 = (PPOConfig().training(gamma=0.9, lr=0.01)\n",
-    "           .environment(env=Grid2opEnv, env_config={})\n",
+    "           .environment(env=Grid2opEnvWrapper, env_config={})\n",
     "           .resources(num_gpus=0)\n",
     "           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)\n",
     "           .framework(\"tf2\")\n",
@@ -326,7 +389,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 4 Use non default parameters to make the grid2op environment\n",
+    "### 4.2 Use non default parameters to make the grid2op environment\n",
     "\n",
     "In this third example, we will train a policy using the \"box\" action space, and on another environment (`l2rpn_idf_2023` instead of `l2rpn_case14_sandbox`)"
    ]
@@ -345,7 +408,7 @@
     "              \"act_type\": \"box\",\n",
     "             }\n",
     "config3 = (PPOConfig().training(gamma=0.9, lr=0.01)\n",
-    "           .environment(env=Grid2opEnv, env_config=env_config)\n",
+    "           .environment(env=Grid2opEnvWrapper, env_config=env_config)\n",
     "           .resources(num_gpus=0)\n",
     "           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)\n",
     "           .framework(\"tf2\")\n",
@@ -392,7 +455,7 @@
     "               \"act_type\": \"multi_discrete\",\n",
     "               }\n",
     "config4 = (PPOConfig().training(gamma=0.9, lr=0.01)\n",
-    "           .environment(env=Grid2opEnv, env_config=env_config4)\n",
+    "           .environment(env=Grid2opEnvWrapper, env_config=env_config4)\n",
     "           .resources(num_gpus=0)\n",
     "           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)\n",
     "           .framework(\"tf2\")\n",
@@ -422,7 +485,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 5 Customize the policy (number of layers, size of layers etc.)\n",
+    "### 4.3 Customize the policy (number of layers, size of layers etc.)\n",
     "\n",
     "This notebook does not aim at covering all possibilities offered by ray / rllib. For that you need to refer to the ray / rllib documentation.\n",
     "\n",
@@ -439,7 +502,7 @@
     "\n",
     "# Use a \"Box\" action space (mainly to use redispatching, curtailment and storage units)\n",
     "config5 = (PPOConfig().training(gamma=0.9, lr=0.01)\n",
-    "           .environment(env=Grid2opEnv, env_config={})\n",
+    "           .environment(env=Grid2opEnvWrapper, env_config={})\n",
     "           .resources(num_gpus=0)\n",
     "           .env_runners(num_env_runners=2, num_envs_per_env_runner=1, num_cpus_per_env_runner=1)\n",
     "           .framework(\"tf2\")\n",
diff --git a/getting_started/11_stable_baselines3_integration.ipynb b/getting_started/11_stable_baselines3_integration.ipynb