Skip to content

Commit 547acbe

Browse files
maryamhonaritaozhuo
authored andcommitted
Trainer qa fix (#78)
* grammer fixes * fix nit comments from QA * add info about on/off policy * add more context to the code block * more context and minor fix * pip3 Co-authored-by: zhuo <[email protected]>
1 parent df96d5c commit 547acbe

File tree

4 files changed

+48
-52
lines changed

4 files changed

+48
-52
lines changed

docs/Learning-Environment-Examples.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Example Learning Environments
22

3-
<img src="../images/example-envs.png" align="middle" width="3000"/>
3+
<img src="images/example-envs.png" align="middle" width="3000"/>
44

55
The Unity ML-Agents Toolkit includes an expanding set of example environments
66
that highlight the various features of the toolkit. These environments can also

docs/Python-Custom-Trainer-Plugin.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,9 @@ in `Ml-agents` Package. This will allow rerouting `mlagents-learn` CLI to custom
66
with hyper-parameters specific to your new trainers. We will expose a high-level extensible trainer (both on-policy,
77
and off-policy trainers) optimizer and hyperparameter classes with documentation for the use of this plugin. For more
88
infromation on how python plugin system works see [Plugin interfaces](Training-Plugins.md).
9-
109
## Overview
10+
Model-free RL algorithms generally fall into two broad categories: on-policy and off-policy. On-policy algorithms perform updates based on data gathered from the current policy. Off-policy algorithms learn a Q function from a buffer of previous data, then use this Q function to make decisions. Off-policy algorithms have three key benefits in the context of ML-Agents: They tend to use fewer samples than on-policy as they can pull and re-use data from the buffer many times. They allow player demonstrations to be inserted in-line with RL data into the buffer, enabling new ways of doing imitation learning by streaming player data.
11+
1112
To add new custom trainers to ML-agents, you would need to create a new python package.
1213
To give you an idea of how to structure your package, we have created a [mlagents_trainer_plugin](../ml-agents-trainer-plugin) package ourselves as an
1314
example, with implementation of `A2c` and `DQN` algorithms. You would need a `setup.py` file to list extra requirements and
@@ -31,22 +32,20 @@ configuration.
3132
└── setup.py
3233
```
3334
## Installation and Execution
34-
To install your new package, you need to have `ml-agents-env` and `ml-agents` installed following by the installation of
35-
plugin package.
35+
If you haven't already, follow the [installation instructions](Installation.md). Once you have the `ml-agents-env` and `ml-agents` packages you can install the plugin package. From the repository's root directory install `ml-agents-trainer-plugin` (or replace with the name of your plugin folder).
3636

37-
```shell
38-
> pip3 install -e ./ml-agents-envs && pip3 install -e ./ml-agents
39-
> pip install -e <./ml-agents-trainer-plugin>
37+
```sh
38+
pip3 install -e <./ml-agents-trainer-plugin>
4039
```
4140

4241
Following the previous installations your package is added as an entrypoint and you can use a config file with new
4342
trainers:
44-
```shell
43+
```sh
4544
mlagents-learn ml-agents-trainer-plugin/mlagents_trainer_plugin/a2c/a2c_3DBall.yaml --run-id <run-id-name>
4645
--env <env-executable>
4746
```
4847

4948
## Tutorial
50-
Here’s a step-by-step [tutorial](.) on how to write a setup file and extend ml-agents trainers, optimizers, and
49+
Here’s a step-by-step [tutorial](Tutorial-Custom-Trainer-Plugin.md) on how to write a setup file and extend ml-agents trainers, optimizers, and
5150
hyperparameter settings.To extend ML-agents classes see references on
5251
[trainers](Python-On-Off-Policy-Trainer-Documentation.md) and [Optimizer](Python-Optimizer-Documentation.md).

docs/Readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ rich environments and then made accessible to the wider research and game
2323
developer communities.
2424

2525
## Features
26-
- 18+ [example Unity environments](Learning-Environment-Examples.md)
26+
- 17+ [example Unity environments](Learning-Environment-Examples.md)
2727
- Support for multiple environment configurations and training scenarios
2828
- Flexible Unity SDK that can be integrated into your game or custom Unity scene
2929
- Support for training single-agent, multi-agent cooperative, and multi-agent

docs/Tutorial-Custom-Trainer-Plugin.md

Lines changed: 39 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,15 @@
1-
### Step 1: Write your own custom trainer class
2-
Before you start writing your code, make sure to create a python environment:
1+
### Step 1: Write your custom trainer class
2+
Before you start writing your code, make sure to use your favorite environment management tool(e.g. `venv` or `conda`) to create and activate a Python virtual environment. The following command uses `conda`, but other tools work similarly:
33
```shell
4-
conda create -n trainer-env python=3.8
4+
conda create -n trainer-env python=3.8.13
5+
conda activate trainer-env
56
```
67

78
Users of the plug-in system are responsible for implementing the trainer class subject to the API standard. Let us follow an example by implementing a custom trainer named "YourCustomTrainer". You can either extend `OnPolicyTrainer` or `OffPolicyTrainer` classes depending on the training strategies you choose.
89

9-
Model-free RL algorithms generally fall into two broad categories: on-policy and off-policy. On-policy algorithms rely on performing updates based on data gathered from the current policy. Off-policy algorithms learn a Q function from a buffer of previous data, then use this Q function to make decisions. Off-policy algorithms have three key benefits in the context of ML-Agents:
10-
They tend to use fewer samples than on-policy as they can pull and re-use data from the buffer many times.
11-
They allow player demonstrations to be inserted in-line with RL data into the buffer, enabling new ways of doing imitation learning by streaming player data.
12-
They are conducive to distributed training, where the policy running on other machines may not be synchronized with the current policy.
13-
However, until recently, off-policy algorithms tended to be more brittle, had difficulty with exploration, and were usually not as useful for continuous control problems. Soft Actor-Critic (Haarnoja et. al, 2018) is an off-policy algorithm that combines the sample-efficiency of Q-learning with the stochasticity of a policy-gradient method such as PPO.
10+
Please refer to the internal [PPO implementation](../ml-agents/mlagents/trainers/ppo/trainer.py) for a complete code example. We will not provide a workable code in the document. The purpose of the tutorial is to introduce you to the core components and interfaces of our plugin framework. We use code snippets and patterns to demonstrate the control and data flow.
1411

15-
Your custom trainers are Responsible for collecting experiences and training the models. Your custom trainer class acts like a co-ordinator to the policy and optimizer. To start implement methods in the class, create a policy and an optimizer class objects:
12+
Your custom trainers are responsible for collecting experiences and training the models. Your custom trainer class acts like a co-ordinator to the policy and optimizer. To start implementing methods in the class, create a policy class objects from method `create_policy`:
1613

1714

1815
```python
@@ -44,9 +41,9 @@ def create_policy(
4441

4542
```
4643

47-
Depending on whether you use shared or separate network architecuture for your policy, we provide `SimpleActor` and `SharedActorCritic` from `mlagents.trainers.torch_entities.networks` that you can choose from. In our example above, we use a `SimpleActor`
44+
Depending on whether you use shared or separate network architecture for your policy, we provide `SimpleActor` and `SharedActorCritic` from `mlagents.trainers.torch_entities.networks` that you can choose from. In our example above, we use a `SimpleActor`.
4845

49-
Next, create an optimizer class object from `create_optimizer` method:
46+
Next, create an optimizer class object from `create_optimizer` method and connect it to the policy object you created above:
5047

5148

5249
```python
@@ -57,9 +54,10 @@ def create_optimizer(self) -> TorchOptimizer:
5754

5855
```
5956

60-
There are a couple abstract methods(`_process_trajectory` and `_update_policy`) inherited from `RLTrainer` you need to implement in your custom trainer class. `_process_trajectory` takes a trajectory and processes it, puts it into the update buffer. Processing involves calculating value and advantage targets for the model updating step. Given input `trajectory: Trajectory`, users are responsible for processing the data in the trajectory and append `agent_buffer_trajectory` to the back of update buffer by calling `self._append_to_update_buffer(agent_buffer_trajectory)`, whose output will be used in updating the model in `optimizer` class.
57+
There are a couple of abstract methods(`_process_trajectory` and `_update_policy`) inherited from `RLTrainer` that you need to implement in your custom trainer class. `_process_trajectory` takes a trajectory and processes it, putting it into the update buffer. Processing involves calculating value and advantage targets for the model updating step. Given input `trajectory: Trajectory`, users are responsible for processing the data in the trajectory and append `agent_buffer_trajectory` to the back of the update buffer by calling `self._append_to_update_buffer(agent_buffer_trajectory)`, whose output will be used in updating the model in `optimizer` class.
58+
59+
A typical `_process_trajectory` function(incomplete) will convert a trajectory object to an agent buffer then get all value estimates from the trajectory by calling `self.optimizer.get_trajectory_value_estimates`. From the returned dictionary of value estimates we extract reward signals keyed by their names:
6160

62-
A typical `_process_trajectory` function(incomplete) - would look like the following:
6361
```python
6462
def _process_trajectory(self, trajectory: Trajectory) -> None:
6563
super()._process_trajectory(trajectory)
@@ -72,7 +70,7 @@ def _process_trajectory(self, trajectory: Trajectory) -> None:
7270
value_estimates,
7371
value_next,
7472
value_memories,
75-
) = self.optimizer.get_trajectory_value_estimates(
73+
) = self.optimizer.get_trajectory_value_estimates(
7674
agent_buffer_trajectory,
7775
trajectory.next_obs,
7876
trajectory.done_reached and not trajectory.interrupted,
@@ -105,15 +103,14 @@ def _process_trajectory(self, trajectory: Trajectory) -> None:
105103

106104
```
107105

108-
A trajectory will be a list of dictionaries of string to anything. When calling forward on a policy, the argument will include an “experience” dict of string to anything from the last step. The forward method will generate action and the next “experience” dictionary. Examples of fields in the “experience” dictionary include observation, action, reward, done status, group_reward, LSTM memory state, etc...
106+
A trajectory will be a list of dictionaries of strings mapped to `Anything`. When calling `forward` on a policy, the argument will include an “experience” dictionary from the last step. The `forward` method will generate an action and the next “experience” dictionary. Examples of fields in the “experience” dictionary include observation, action, reward, done status, group_reward, LSTM memory state, etc.
109107

110108

111109

112110
### Step 2: implement your custom optimizer for the trainer.
113-
We will show you an example we implemented - `class TorchPPOOptimizer(TorchOptimizer)`, Which Takes a Policy and a Dict of trainer parameters and creates an Optimizer around the policy. Your optimizer should include a value estimator and a loss function in the update method
111+
We will show you an example we implemented - `class TorchPPOOptimizer(TorchOptimizer)`, which takes a Policy and a Dict of trainer parameters and creates an Optimizer that connects to the policy. Your optimizer should include a value estimator and a loss function in the `update` method.
114112

115-
Before writing your optimizer class, first define setting class `class PPOSettings(OnPolicyHyperparamSettings):
116-
` for your custom optimizer:
113+
Before writing your optimizer class, first define setting class `class PPOSettings(OnPolicyHyperparamSettings)` for your custom optimizer:
117114

118115

119116

@@ -130,15 +127,15 @@ class PPOSettings(OnPolicyHyperparamSettings):
130127

131128
```
132129

133-
You should implement `update` function:
130+
You should implement `update` function following interface:
134131

135132

136133
```python
137134
def update(self, batch: AgentBuffer, num_sequences: int) -> Dict[str, float]:
138135

139136
```
140137

141-
Calculate losses and other metrics from an `AgentBuffer` generated from your trainer class, a typical pattern(incomplete) would like this:
138+
In which losses and other metrics are calculated from an `AgentBuffer` that is generated from your trainer class, depending on which model you choose to implement the loss functions will be different. In our case we calculate value loss from critic and trust region policy loss. A typical pattern(incomplete) of the calculations will look like the following:
142139

143140

144141
```python
@@ -173,7 +170,7 @@ loss = (
173170

174171
```
175172

176-
Update the model and return the a dictionary including calculated losses and updated decay learning rate:
173+
Finally update the model and return the a dictionary including calculated losses and updated decay learning rate:
177174

178175

179176
```python
@@ -183,8 +180,6 @@ loss.backward()
183180

184181
self.optimizer.step()
185182
update_stats = {
186-
# NOTE: abs() is not technically correct, but matches the behavior in TensorFlow.
187-
# TODO: After PyTorch is default, change to something more correct.
188183
"Losses/Policy Loss": torch.abs(policy_loss).item(),
189184
"Losses/Value Loss": value_loss.item(),
190185
"Policy/Learning Rate": decay_lr,
@@ -216,7 +211,7 @@ your_trainer_type: name your trainer type, used in configuration file
216211
your_package: your pip installable package containing custom trainer implementation
217212
```
218213

219-
Also define get_type_and_setting method in YourCustomTrainer class:
214+
Also define `get_type_and_setting` method in `YourCustomTrainer` class:
220215

221216

222217
```python
@@ -250,7 +245,7 @@ pip3 install your_custom_package
250245
```
251246
Or follow our internal implementations:
252247
```shell
253-
pip install -e ./ml-agents-trainer-plugin
248+
pip3 install -e ./ml-agents-trainer-plugin
254249
```
255250

256251
Following the previous installations your package is added as an entrypoint and you can use a config file with new
@@ -261,43 +256,45 @@ mlagents-learn ml-agents-trainer-plugin/mlagents_trainer_plugin/a2c/a2c_3DBall.y
261256
```
262257

263258
### Validate your implementations:
264-
Create a clean python environment with python 3.8+ before you start.
259+
Create a clean Python environment with Python 3.8+ and activate it before you start, if you haven't done so already:
265260
```shell
266-
conda create -n trainer-env python=3.8
261+
conda create -n trainer-env python=3.8.13
262+
conda activate trainer-env
267263
```
268264

269-
Make sure you follow previous steps and install all required packages. We are testing internal implementations here, but ML-Agents users can run similar validations once they have their own implementation installed:
265+
Make sure you follow previous steps and install all required packages. We are testing internal implementations in this tutorial, but ML-Agents users can run similar validations once they have their own implementations installed:
270266
```shell
271267
pip3 install -e ./ml-agents-envs && pip3 install -e ./ml-agents
272-
pip install -e ./ml-agents-trainer-plugin
268+
pip3 install -e ./ml-agents-trainer-plugin
273269
```
274-
Once your package is added as an entrypoint and you can use a config file with new trainer. Check if trainer type is specified in the config file `a2c_3DBall.yaml`:
270+
Once your package is added as an `entrypoint`, you can add to the config file the new trainer type. Check if trainer type is specified in the config file `a2c_3DBall.yaml`:
275271
```
276272
trainer_type: a2c
277273
```
278274

279-
Test if custom trainer package is install:
275+
Test if custom trainer package is installed by running:
280276
```shell
281277
mlagents-learn ml-agents-trainer-plugin/mlagents_trainer_plugin/a2c/a2c_3DBall.yaml --run-id test-trainer
282278
```
283279

280+
You can also list all trainers installed in the registry. Type `python` in your shell to open a REPL session. Run the python code below, you should be able to see all trainer types currently installed:
281+
```python
282+
>>> import pkg_resources
283+
>>> for entry in pkg_resources.iter_entry_points('mlagents.trainer_type'):
284+
... print(entry)
285+
...
286+
default = mlagents.plugins.trainer_type:get_default_trainer_types
287+
a2c = mlagents_trainer_plugin.a2c.a2c_trainer:get_type_and_setting
288+
dqn = mlagents_trainer_plugin.dqn.dqn_trainer:get_type_and_setting
289+
```
290+
284291
If it is properly installed, you will see Unity logo and message indicating training will start:
285292
```
286293
[INFO] Listening on port 5004. Start training by pressing the Play button in the Unity Editor.
287294
```
288295

289-
If you see the following error message, it could be due to train type is wrong or the trainer type specified is not installed:
296+
If you see the following error message, it could be due to trainer type is wrong or the trainer type specified is not installed:
290297
```shell
291298
mlagents.trainers.exception.TrainerConfigError: Invalid trainer type a2c was found
292299
```
293300

294-
You can also check all trainers installed in the registry. Type `python` in your shell to open a REPL session. Run the python code below, you should be able to see all trainer types installed:
295-
```python
296-
>>> import pkg_resources
297-
>>> for entry in pkg_resources.iter_entry_points('mlagents.trainer_type'):
298-
... print(entry)
299-
...
300-
default = mlagents.plugins.trainer_type:get_default_trainer_types
301-
a2c = mlagents_trainer_plugin.a2c.a2c_trainer:get_type_and_setting
302-
dqn = mlagents_trainer_plugin.dqn.dqn_trainer:get_type_and_setting
303-
```

0 commit comments

Comments
 (0)