Skip to content

Commit b2db1d9

Browse files
author
Ervin T
authored
Fix docs for reward signals (#2320)
1 parent 5d7dd57 commit b2db1d9

File tree

2 files changed

+14
-13
lines changed

2 files changed

+14
-13
lines changed

docs/ML-Agents-Overview.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -185,8 +185,8 @@ range of training and inference scenarios:
185185
- **Learning** - where decisions are made using an embedded
186186
[TensorFlow](Background-TensorFlow.md) model. The embedded TensorFlow model
187187
represents a learned policy and the Brain directly uses this model to
188-
determine the action for each Agent. You can train a **Learning Brain**
189-
by dragging it into the Academy's `Broadcast Hub` with the `Control`
188+
determine the action for each Agent. You can train a **Learning Brain**
189+
by dragging it into the Academy's `Broadcast Hub` with the `Control`
190190
checkbox checked.
191191
- **Player** - where decisions are made using real input from a keyboard or
192192
controller. Here, a human player is controlling the Agent and the observations
@@ -224,7 +224,7 @@ inference can proceed.
224224

225225
As mentioned previously, the ML-Agents toolkit ships with several
226226
implementations of state-of-the-art algorithms for training intelligent agents.
227-
In this mode, the only Brain used is a **Learning Brain**. More
227+
In this mode, the only Brain used is a **Learning Brain**. More
228228
specifically, during training, all the medics in the
229229
scene send their observations to the Python API through the External
230230
Communicator (this is the behavior with an External Brain). The Python API
@@ -244,7 +244,7 @@ time.
244244
To summarize: our built-in implementations are based on TensorFlow, thus, during
245245
training the Python API uses the observations it receives to learn a TensorFlow
246246
model. This model is then embedded within the Learning Brain during inference to
247-
generate the optimal actions for all Agents linked to that Brain.
247+
generate the optimal actions for all Agents linked to that Brain.
248248

249249
The
250250
[Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md)
@@ -255,7 +255,7 @@ tutorial covers this training mode with the **3D Balance Ball** sample environme
255255
In the previous mode, the Learning Brain was used for training to generate
256256
a TensorFlow model that the Learning Brain can later use. However,
257257
any user of the ML-Agents toolkit can leverage their own algorithms for
258-
training. In this case, the Brain type would be set to Learning and be linked
258+
training. In this case, the Brain type would be set to Learning and be linked
259259
to the BroadcastHub (with checked `Control` checkbox)
260260
and the behaviors of all the Agents in the scene will be controlled within Python.
261261
You can even turn your environment into a [gym.](../gym-unity/README.md)
@@ -319,8 +319,10 @@ imitation learning algorithm will then use these pairs of observations and
319319
actions from the human player to learn a policy. [Video
320320
Link](https://youtu.be/kpb8ZkMBFYs).
321321

322-
The [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
323-
covers this training mode with the **Banana Collector** sample environment.
322+
ML-Agents provides ways to both learn directly from demonstrations as well as
323+
use demonstrations to help speed up reward-based training. The
324+
[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
325+
covers these features in more depth.
324326

325327
## Flexible Training Scenarios
326328

@@ -408,7 +410,7 @@ training process.
408410
- **Broadcasting** - As discussed earlier, a Learning Brain sends the
409411
observations for all its Agents to the Python API when dragged into the
410412
Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful
411-
for training and later inference. Broadcasting is a feature which can be
413+
for training and later inference. Broadcasting is a feature which can be
412414
enabled all types of Brains (Player, Learning, Heuristic) where the Agent
413415
observations and actions are also sent to the Python API (despite the fact
414416
that the Agent is **not** controlled by the Python API). This feature is

docs/Training-ML-Agents.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -170,10 +170,7 @@ environments are included in the provided config file.
170170
| brain\_to\_imitate | For online imitation learning, the name of the GameObject containing the Brain component to imitate. | (online)BC |
171171
| demo_path | For offline imitation learning, the file path of the recorded demonstration file | (offline)BC |
172172
| buffer_size | The number of experiences to collect before updating the policy model. | PPO |
173-
| curiosity\_enc\_size | The size of the encoding to use in the forward and inverse models in the Curiosity module. | PPO |
174-
| curiosity_strength | Magnitude of intrinsic reward generated by Intrinsic Curiosity Module. | PPO |
175173
| epsilon | Influences how rapidly the policy can evolve during training. | PPO |
176-
| gamma | The reward discount rate for the Generalized Advantage Estimator (GAE). | PPO |
177174
| hidden_units | The number of units in the hidden layers of the neural network. | PPO, BC |
178175
| lambd | The regularization parameter. | PPO |
179176
| learning_rate | The initial learning rate for gradient descent. | PPO, BC |
@@ -182,13 +179,15 @@ environments are included in the provided config file.
182179
| normalize | Whether to automatically normalize observations. | PPO |
183180
| num_epoch | The number of passes to make through the experience buffer when performing gradient descent optimization. | PPO |
184181
| num_layers | The number of hidden layers in the neural network. | PPO, BC |
182+
| pretraining | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations). | PPO |
183+
| reward_signals | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options. | PPO |
185184
| sequence_length | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC |
186185
| summary_freq | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard. | PPO, BC |
187186
| time_horizon | How many steps of experience to collect per-agent before adding it to the experience buffer. | PPO, (online)BC |
188-
| trainer | The type of training to perform: "ppo" or "imitation". | PPO, BC |
189-
| use_curiosity | Train using an additional intrinsic reward signal generated from Intrinsic Curiosity Module. | PPO |
187+
| trainer | The type of training to perform: "ppo", "offline_bc" or "online_bc". | PPO, BC |
190188
| use_recurrent | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC |
191189

190+
192191
\*PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)
193192

194193
For specific advice on setting hyperparameters based on the type of training you

0 commit comments

Comments
 (0)