Fix docs for reward signals (#2320)

Ervin T · web-flow · commit b2db1d96038b · 2019-07-24T17:51:24.000-07:00
diff --git a/docs/ML-Agents-Overview.md b/docs/ML-Agents-Overview.md
@@ -185,8 +185,8 @@ range of training and inference scenarios:
 - **Learning** - where decisions are made using an embedded
   [TensorFlow](Background-TensorFlow.md) model. The embedded TensorFlow model
   represents a learned policy and the Brain directly uses this model to
-  determine the action for each Agent. You can train a **Learning Brain** 
-  by dragging it into the Academy's `Broadcast Hub` with the `Control` 
+  determine the action for each Agent. You can train a **Learning Brain**
+  by dragging it into the Academy's `Broadcast Hub` with the `Control`
   checkbox checked.
 - **Player** - where decisions are made using real input from a keyboard or
   controller. Here, a human player is controlling the Agent and the observations
@@ -224,7 +224,7 @@ inference can proceed.
 
 As mentioned previously, the ML-Agents toolkit ships with several
 implementations of state-of-the-art algorithms for training intelligent agents.
-In this mode, the only Brain used is a **Learning Brain**. More 
+In this mode, the only Brain used is a **Learning Brain**. More
 specifically, during training, all the medics in the
 scene send their observations to the Python API through the External
 Communicator (this is the behavior with an External Brain). The Python API
@@ -244,7 +244,7 @@ time.
 To summarize: our built-in implementations are based on TensorFlow, thus, during
 training the Python API uses the observations it receives to learn a TensorFlow
 model. This model is then embedded within the Learning Brain during inference to
-generate the optimal actions for all Agents linked to that Brain. 
+generate the optimal actions for all Agents linked to that Brain.
 
 The
 [Getting Started with the 3D Balance Ball Example](Getting-Started-with-Balance-Ball.md)
@@ -255,7 +255,7 @@ tutorial covers this training mode with the **3D Balance Ball** sample environme
 In the previous mode, the Learning Brain was used for training to generate
 a TensorFlow model that the Learning Brain can later use. However,
 any user of the ML-Agents toolkit can leverage their own algorithms for
-training. In this case, the Brain type would be set to Learning and be linked 
+training. In this case, the Brain type would be set to Learning and be linked
 to the BroadcastHub (with checked `Control` checkbox)
 and the behaviors of all the Agents in the scene will be controlled within Python.
 You can even turn your environment into a [gym.](../gym-unity/README.md)
@@ -319,8 +319,10 @@ imitation learning algorithm will then use these pairs of observations and
 actions from the human player to learn a policy. [Video
 Link](https://youtu.be/kpb8ZkMBFYs).
 
-The [Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
-covers this training mode with the **Banana Collector** sample environment.
+ML-Agents provides ways to both learn directly from demonstrations as well as
+use demonstrations to help speed up reward-based training. The
+[Training with Imitation Learning](Training-Imitation-Learning.md) tutorial
+covers these features in more depth.
 
 ## Flexible Training Scenarios
 
@@ -408,7 +410,7 @@ training process.
 - **Broadcasting** - As discussed earlier, a Learning Brain sends the
   observations for all its Agents to the Python API when dragged into the
   Academy's `Broadcast Hub` with the `Control` checkbox checked. This is helpful
-  for training and later inference. Broadcasting is a feature which can be 
+  for training and later inference. Broadcasting is a feature which can be
   enabled all types of Brains (Player, Learning, Heuristic) where the Agent
   observations and actions are also sent to the Python API (despite the fact
   that the Agent is **not** controlled by the Python API). This feature is
diff --git a/docs/Training-ML-Agents.md b/docs/Training-ML-Agents.md
@@ -170,10 +170,7 @@ environments are included in the provided config file.
 | brain\_to\_imitate   | For online imitation learning, the name of the GameObject containing the Brain component to imitate.                                                                                    | (online)BC               |
 | demo_path            | For offline imitation learning, the file path of the recorded demonstration file                                                                                                        | (offline)BC              |
 | buffer_size          | The number of experiences to collect before updating the policy model.                                                                                                                  | PPO                      |
-| curiosity\_enc\_size | The size of the encoding to use in the forward and inverse models in the Curiosity module.                                                                                               | PPO                      |
-| curiosity_strength   | Magnitude of intrinsic reward generated by Intrinsic Curiosity Module.                                                                                                                  | PPO                      |
 | epsilon              | Influences how rapidly the policy can evolve during training.                                                                                                                           | PPO                      |
-| gamma                | The reward discount rate for the Generalized Advantage Estimator (GAE).                                                                                                                 | PPO                      |
 | hidden_units         | The number of units in the hidden layers of the neural network.                                                                                                                         | PPO, BC                  |
 | lambd                | The regularization parameter.                                                                                                                                                           | PPO                      |
 | learning_rate        | The initial learning rate for gradient descent.                                                                                                                                         | PPO, BC                  |
@@ -182,13 +179,15 @@ environments are included in the provided config file.
 | normalize            | Whether to automatically normalize observations.                                                                                                                                        | PPO                      |
 | num_epoch            | The number of passes to make through the experience buffer when performing gradient descent optimization.                                                                               | PPO                      |
 | num_layers           | The number of hidden layers in the neural network.                                                                                                                                      | PPO, BC                  |
+| pretraining          | Use demonstrations to bootstrap the policy neural network. See [Pretraining Using Demonstrations](Training-PPO.md#optional-pretraining-using-demonstrations).                                                                                            | PPO                      |
+| reward_signals       | The reward signals used to train the policy. Enable Curiosity and GAIL here. See [Reward Signals](Training-RewardSignals.md) for configuration options.                                                                                            | PPO                      |
 | sequence_length      | Defines how long the sequences of experiences must be while training. Only used for training with a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md). | PPO, BC                  |
 | summary_freq         | How often, in steps, to save training statistics. This determines the number of data points shown by TensorBoard.                                                                       | PPO, BC                  |
 | time_horizon         | How many steps of experience to collect per-agent before adding it to the experience buffer.                                                                                            | PPO, (online)BC          |
-| trainer              | The type of training to perform: "ppo" or "imitation".                                                                                                                                  | PPO, BC                  |
-| use_curiosity        | Train using an additional intrinsic reward signal generated from Intrinsic Curiosity Module.                                                                                            | PPO                      |
+| trainer              | The type of training to perform: "ppo", "offline_bc" or "online_bc".                                                                                                                                  | PPO, BC                  |
 | use_recurrent        | Train using a recurrent neural network. See [Using Recurrent Neural Networks](Feature-Memory.md).                                                                                       | PPO, BC                  |
 
+
 \*PPO = Proximal Policy Optimization, BC = Behavioral Cloning (Imitation)
 
 For specific advice on setting hyperparameters based on the type of training you