adding a notebook for stable baselines, need to check it works and clean it now [skip ci]

BDonnot · BDonnot · commit caf639ed9d79 · 2024-06-14T17:24:17.000+02:00
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -34,7 +34,8 @@ Change Log
 
 - TODO A number of max buses per sub
 - TODO in the runner, save multiple times the same sceanrio
-
+- TODO in the gym env, make the action_space and observation_space attribute
+  filled automatically (see ray integration, it's boring to have to copy paste...)
 
 [1.10.3] - 2024-xx-yy
 -------------------------
diff --git a/getting_started/11_ray_integration.ipynb b/getting_started/11_ray_integration.ipynb
@@ -15,26 +15,66 @@
     "\n",
     "This notebook is more an \"example of what works\" rather than a deep dive tutorial.\n",
     "\n",
-    "See https://docs.ray.io/en/latest/rllib/rllib-env.html#configuring-environments for a more detailed information.\n",
-    "\n",
-    "See also https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html for other details\n",
-    "\n",
-    "This notebook is tested with grid2op 1.10 and ray 2.23 on an ubuntu 20.04 machine.\n",
-    "\n",
+    "See stable-baselines3.readthedocs.io/ for a more detailed information.\n",
+    "\n",
+    "This notebook is tested with grid2op 1.10.2 and stable baselines3 version 2.3.2 on an ubuntu 20.04 machine.\n",
+    "\n",
+    "\n",
+    "## 0 Some tips to get started\n",
+    "\n",
+    "<font color='red'> It is unlikely that \"simply\" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments.</font>\n",
+    "\n",
+    "To make RL algorithms work with more or less sucess you might want to:\n",
+    "\n",
+    "  1) ajust the observation space: in particular selecting the right information for your agent. Too much information\n",
+    "     and the size of the observation space will blow up and your agent will not learn anything. Not enough\n",
+    "     information and your agent will not be able to capture anything.\n",
+    "     \n",
+    "  2) customize the action space: dealing with both discrete and continuous values is often a challenge. So maybe you      want to focus on only one type of action. And in all cases, try to still reduce the amount of actions your\n",
+    "     agent \n",
+    "     can perform. Indeed, for \"larger\" grids (118 substations, as a reference the french grid counts more than 6.000\n",
+    "     such substations...) and by limiting 2 busbars per substation (as a reference, for some subsations, you have more\n",
+    "     than 12 such \"busbars\") your agent will have the opportunity to choose between more than 60.000 different discrete\n",
+    "     actions each steps. This is way too large for current RL algorithm as far as we know (and proposed environment are\n",
+    "     small in comparison to real one)\n",
+    "     \n",
+    "  3) customize the reward: the default reward might not work great for you. Ultimately, what TSO's or ISO's want is\n",
+    "     to operate the grid safely, as long as possible with a cost as low as possible. This is of course really hard to\n",
+    "     catch everything in one single reward signal. Customizing the reward is also really important because the \"do\n",
+    "     nothing\" policy often leads to really good results (much better than random actions) which makes exploration \n",
+    "     different actions...). So you kind of want to incentivize your agent to perform some actions at some point.\n",
+    "  \n",
+    "  4) use fast simulator: even if you target an industrial application with industry grade simulators, we still would\n",
+    "     advise you to use (at early stage of training at least) fast simulator for the vast majority of the training\n",
+    "     process and then maybe to fine tune on better one.\n",
+    "  \n",
+    "  5) combine RL with some heuristics: it's super easy to implement things like \"if there is no issue, then do\n",
+    "     nothing\". This can be quite time consuming to learn though. Don't hesitate to check out the \"l2rpn-baselines\"\n",
+    "     repository for already \"kind of working\" heuristics\n",
+    "     \n",
+    "And finally don't hesitate to check solution proposed by winners of past l2rpn competitions in l2rpn-baselines.\n",
+    "\n",
+    "You can also ask question on our discord or on our github."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "\n",
     "## 1 Create the \"Grid2opEnv\" class\n",
     "\n",
-    "In the next cell, we define a custom environment (that will internally use the `GymEnv` grid2op class) that is needed for ray / rllib.\n",
+    "In the next cell, we define a custom environment (that will internally use the `GymEnv` grid2op class). It is not strictly needed\n",
     "\n",
     "Indeed, in order to work with ray / rllib you need to define a custom wrapper on top of the GymEnv wrapper. You then have:\n",
     "\n",
     "- self._g2op_env which is the default grid2op environment, receiving grid2op Action and producing grid2op Observation.\n",
     "- self._gym_env which is a the grid2op defined `gymnasium Environment` that cannot be directly used with ray / rllib\n",
-    "- `Grid2opEnv` which is a the wrapper on top of `self._gym_env` to make it usable with ray / rllib.\n",
+    "- `Grid2opEnvWrapper` which is a the wrapper on top of `self._gym_env` to make it usable with ray / rllib.\n",
     "\n",
-    "Ray / rllib expects the gymnasium environment to inherit from `gymnasium.Env` and to be initialized with a given configuration. This is why you need to create the `Grid2opEnv` wrapper on top of `GymEnv`.\n",
+    "Ray / rllib expects the gymnasium environment to inherit from `gymnasium.Env` and to be initialized with a given configuration. This is why you need to create the `Grid2opEnvWrapper` wrapper on top of `GymEnv`.\n",
     "\n",
-    "In the initialization of `Grid2opEnv`, the `env_config` variable is a dictionary that can take as key-word arguments:\n",
+    "In the initialization of `Grid2opEnvWrapper`, the `env_config` variable is a dictionary that can take as key-word arguments:\n",
     "\n",
     "- `backend_cls` : what is the class of the backend. If not provided, it will use `LightSimBackend` from the `lightsim2grid` package\n",
     "- `backend_options`: what options will be used to create the backend for your environment. Your backend will be created by calling\n",
@@ -74,7 +114,7 @@
     "from lightsim2grid import LightSimBackend\n",
     "\n",
     "\n",
-    "class Grid2opEnv(Env):\n",
+    "class Grid2opEnvWrapper(Env):\n",
     "    def __init__(self,\n",
     "                 env_config: Dict[Literal[\"backend_cls\",\n",
     "                                          \"backend_options\",\n",
@@ -83,7 +123,7 @@
     "                                          \"obs_attr_to_keep\",\n",
     "                                          \"act_type\",\n",
     "                                          \"act_attr_to_keep\"],\n",
-    "                                  Any]):\n",
+    "                                  Any]= None):\n",
     "        super().__init__()\n",
     "        if env_config is None:\n",
     "            env_config = {}\n",
@@ -207,7 +247,7 @@
     "# Construct a generic config object, specifying values within different\n",
     "# sub-categories, e.g. \"training\".\n",
     "config = (PPOConfig().training(gamma=0.9, lr=0.01)\n",
-    "          .environment(env=Grid2opEnv, env_config={})\n",
+    "          .environment(env=Grid2opEnvWrapper, env_config={})\n",
     "          .resources(num_gpus=0)\n",
     "          .env_runners(num_env_runners=0)\n",
     "          .framework(\"tf2\")\n",
@@ -239,7 +279,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3 Train a PPO agent using 2 \"runners\" to make the rollouts"
+    "## 3 Train a PPO agent using 2 \"runners\" to make the rollouts\n",
+    "\n",
+    "In this second example, we explain briefly how to train the model using 2 \"processes\". This is, the agent will interact with 2 agents at the same time during the \"rollout\" phases.\n",
+    "\n",
+    "But everything related to the training of the agent is still done on the main process (and in this case not using a GPU but only a CPU)."
    ]
   },
   {
@@ -250,7 +294,7 @@
    "source": [
     "# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html\n",
     "\n",
-    "# use multiple use multiple runners\n",
+    "# use multiple runners\n",
     "config2 = (PPOConfig().training(gamma=0.9, lr=0.01)\n",
     "           .environment(env=Grid2opEnv, env_config={})\n",
     "           .resources(num_gpus=0)\n",
@@ -282,9 +326,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 4 Use non default parameters to make the l2rpn environment\n",
+    "## 4 Use non default parameters to make the grid2op environment\n",
     "\n",
-    "In this first example, we will train a policy using the \"box\" action space."
+    "In this third example, we will train a policy using the \"box\" action space, and on another environment (`l2rpn_idf_2023` instead of `l2rpn_case14_sandbox`)"
    ]
   },
   {
@@ -441,7 +485,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.13"
+   "version": "3.8.10"
   }
  },
  "nbformat": 4,
diff --git a/getting_started/11_stable_baselines3_integration.ipynb b/getting_started/11_stable_baselines3_integration.ipynb