|
15 | 15 | "\n", |
16 | 16 | "This notebook is more an \"example of what works\" rather than a deep dive tutorial.\n", |
17 | 17 | "\n", |
18 | | - "See https://docs.ray.io/en/latest/rllib/rllib-env.html#configuring-environments for a more detailed information.\n", |
19 | | - "\n", |
20 | | - "See also https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html for other details\n", |
21 | | - "\n", |
22 | | - "This notebook is tested with grid2op 1.10 and ray 2.23 on an ubuntu 20.04 machine.\n", |
23 | | - "\n", |
| 18 | + "See stable-baselines3.readthedocs.io/ for a more detailed information.\n", |
| 19 | + "\n", |
| 20 | + "This notebook is tested with grid2op 1.10.2 and stable baselines3 version 2.3.2 on an ubuntu 20.04 machine.\n", |
| 21 | + "\n", |
| 22 | + "\n", |
| 23 | + "## 0 Some tips to get started\n", |
| 24 | + "\n", |
| 25 | + "<font color='red'> It is unlikely that \"simply\" using a RL algorithm on a grid2op environment will lead to good results for the vast majority of environments.</font>\n", |
| 26 | + "\n", |
| 27 | + "To make RL algorithms work with more or less sucess you might want to:\n", |
| 28 | + "\n", |
| 29 | + " 1) ajust the observation space: in particular selecting the right information for your agent. Too much information\n", |
| 30 | + " and the size of the observation space will blow up and your agent will not learn anything. Not enough\n", |
| 31 | + " information and your agent will not be able to capture anything.\n", |
| 32 | + " \n", |
| 33 | + " 2) customize the action space: dealing with both discrete and continuous values is often a challenge. So maybe you want to focus on only one type of action. And in all cases, try to still reduce the amount of actions your\n", |
| 34 | + " agent \n", |
| 35 | + " can perform. Indeed, for \"larger\" grids (118 substations, as a reference the french grid counts more than 6.000\n", |
| 36 | + " such substations...) and by limiting 2 busbars per substation (as a reference, for some subsations, you have more\n", |
| 37 | + " than 12 such \"busbars\") your agent will have the opportunity to choose between more than 60.000 different discrete\n", |
| 38 | + " actions each steps. This is way too large for current RL algorithm as far as we know (and proposed environment are\n", |
| 39 | + " small in comparison to real one)\n", |
| 40 | + " \n", |
| 41 | + " 3) customize the reward: the default reward might not work great for you. Ultimately, what TSO's or ISO's want is\n", |
| 42 | + " to operate the grid safely, as long as possible with a cost as low as possible. This is of course really hard to\n", |
| 43 | + " catch everything in one single reward signal. Customizing the reward is also really important because the \"do\n", |
| 44 | + " nothing\" policy often leads to really good results (much better than random actions) which makes exploration \n", |
| 45 | + " different actions...). So you kind of want to incentivize your agent to perform some actions at some point.\n", |
| 46 | + " \n", |
| 47 | + " 4) use fast simulator: even if you target an industrial application with industry grade simulators, we still would\n", |
| 48 | + " advise you to use (at early stage of training at least) fast simulator for the vast majority of the training\n", |
| 49 | + " process and then maybe to fine tune on better one.\n", |
| 50 | + " \n", |
| 51 | + " 5) combine RL with some heuristics: it's super easy to implement things like \"if there is no issue, then do\n", |
| 52 | + " nothing\". This can be quite time consuming to learn though. Don't hesitate to check out the \"l2rpn-baselines\"\n", |
| 53 | + " repository for already \"kind of working\" heuristics\n", |
| 54 | + " \n", |
| 55 | + "And finally don't hesitate to check solution proposed by winners of past l2rpn competitions in l2rpn-baselines.\n", |
| 56 | + "\n", |
| 57 | + "You can also ask question on our discord or on our github." |
| 58 | + ] |
| 59 | + }, |
| 60 | + { |
| 61 | + "cell_type": "markdown", |
| 62 | + "metadata": {}, |
| 63 | + "source": [ |
24 | 64 | "\n", |
25 | 65 | "## 1 Create the \"Grid2opEnv\" class\n", |
26 | 66 | "\n", |
27 | | - "In the next cell, we define a custom environment (that will internally use the `GymEnv` grid2op class) that is needed for ray / rllib.\n", |
| 67 | + "In the next cell, we define a custom environment (that will internally use the `GymEnv` grid2op class). It is not strictly needed\n", |
28 | 68 | "\n", |
29 | 69 | "Indeed, in order to work with ray / rllib you need to define a custom wrapper on top of the GymEnv wrapper. You then have:\n", |
30 | 70 | "\n", |
31 | 71 | "- self._g2op_env which is the default grid2op environment, receiving grid2op Action and producing grid2op Observation.\n", |
32 | 72 | "- self._gym_env which is a the grid2op defined `gymnasium Environment` that cannot be directly used with ray / rllib\n", |
33 | | - "- `Grid2opEnv` which is a the wrapper on top of `self._gym_env` to make it usable with ray / rllib.\n", |
| 73 | + "- `Grid2opEnvWrapper` which is a the wrapper on top of `self._gym_env` to make it usable with ray / rllib.\n", |
34 | 74 | "\n", |
35 | | - "Ray / rllib expects the gymnasium environment to inherit from `gymnasium.Env` and to be initialized with a given configuration. This is why you need to create the `Grid2opEnv` wrapper on top of `GymEnv`.\n", |
| 75 | + "Ray / rllib expects the gymnasium environment to inherit from `gymnasium.Env` and to be initialized with a given configuration. This is why you need to create the `Grid2opEnvWrapper` wrapper on top of `GymEnv`.\n", |
36 | 76 | "\n", |
37 | | - "In the initialization of `Grid2opEnv`, the `env_config` variable is a dictionary that can take as key-word arguments:\n", |
| 77 | + "In the initialization of `Grid2opEnvWrapper`, the `env_config` variable is a dictionary that can take as key-word arguments:\n", |
38 | 78 | "\n", |
39 | 79 | "- `backend_cls` : what is the class of the backend. If not provided, it will use `LightSimBackend` from the `lightsim2grid` package\n", |
40 | 80 | "- `backend_options`: what options will be used to create the backend for your environment. Your backend will be created by calling\n", |
|
74 | 114 | "from lightsim2grid import LightSimBackend\n", |
75 | 115 | "\n", |
76 | 116 | "\n", |
77 | | - "class Grid2opEnv(Env):\n", |
| 117 | + "class Grid2opEnvWrapper(Env):\n", |
78 | 118 | " def __init__(self,\n", |
79 | 119 | " env_config: Dict[Literal[\"backend_cls\",\n", |
80 | 120 | " \"backend_options\",\n", |
|
83 | 123 | " \"obs_attr_to_keep\",\n", |
84 | 124 | " \"act_type\",\n", |
85 | 125 | " \"act_attr_to_keep\"],\n", |
86 | | - " Any]):\n", |
| 126 | + " Any]= None):\n", |
87 | 127 | " super().__init__()\n", |
88 | 128 | " if env_config is None:\n", |
89 | 129 | " env_config = {}\n", |
|
207 | 247 | "# Construct a generic config object, specifying values within different\n", |
208 | 248 | "# sub-categories, e.g. \"training\".\n", |
209 | 249 | "config = (PPOConfig().training(gamma=0.9, lr=0.01)\n", |
210 | | - " .environment(env=Grid2opEnv, env_config={})\n", |
| 250 | + " .environment(env=Grid2opEnvWrapper, env_config={})\n", |
211 | 251 | " .resources(num_gpus=0)\n", |
212 | 252 | " .env_runners(num_env_runners=0)\n", |
213 | 253 | " .framework(\"tf2\")\n", |
|
239 | 279 | "cell_type": "markdown", |
240 | 280 | "metadata": {}, |
241 | 281 | "source": [ |
242 | | - "## 3 Train a PPO agent using 2 \"runners\" to make the rollouts" |
| 282 | + "## 3 Train a PPO agent using 2 \"runners\" to make the rollouts\n", |
| 283 | + "\n", |
| 284 | + "In this second example, we explain briefly how to train the model using 2 \"processes\". This is, the agent will interact with 2 agents at the same time during the \"rollout\" phases.\n", |
| 285 | + "\n", |
| 286 | + "But everything related to the training of the agent is still done on the main process (and in this case not using a GPU but only a CPU)." |
243 | 287 | ] |
244 | 288 | }, |
245 | 289 | { |
|
250 | 294 | "source": [ |
251 | 295 | "# see https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.html\n", |
252 | 296 | "\n", |
253 | | - "# use multiple use multiple runners\n", |
| 297 | + "# use multiple runners\n", |
254 | 298 | "config2 = (PPOConfig().training(gamma=0.9, lr=0.01)\n", |
255 | 299 | " .environment(env=Grid2opEnv, env_config={})\n", |
256 | 300 | " .resources(num_gpus=0)\n", |
|
282 | 326 | "cell_type": "markdown", |
283 | 327 | "metadata": {}, |
284 | 328 | "source": [ |
285 | | - "## 4 Use non default parameters to make the l2rpn environment\n", |
| 329 | + "## 4 Use non default parameters to make the grid2op environment\n", |
286 | 330 | "\n", |
287 | | - "In this first example, we will train a policy using the \"box\" action space." |
| 331 | + "In this third example, we will train a policy using the \"box\" action space, and on another environment (`l2rpn_idf_2023` instead of `l2rpn_case14_sandbox`)" |
288 | 332 | ] |
289 | 333 | }, |
290 | 334 | { |
|
441 | 485 | "name": "python", |
442 | 486 | "nbconvert_exporter": "python", |
443 | 487 | "pygments_lexer": "ipython3", |
444 | | - "version": "3.10.13" |
| 488 | + "version": "3.8.10" |
445 | 489 | } |
446 | 490 | }, |
447 | 491 | "nbformat": 4, |
|
0 commit comments