Fix gradient not averaged when parallel training. (#1104)

shishaochen · web-flow · commit 32ccbb546152 · 2021-09-06T16:12:35.000+08:00
* Fix gradient not averaged when parallel training.

* Correct throughput metrics and explain CPU runtime in the parallel-training tutorial.
diff --git a/deepmd/train/trainer.py b/deepmd/train/trainer.py
@@ -384,10 +384,10 @@ def _build_training(self):
             optimizer = self.run_opt._HVD.DistributedOptimizer(optimizer)
         else:
             optimizer = tf.train.AdamOptimizer(learning_rate = self.learning_rate)
-        grads = tf.gradients(self.l2_l, trainable_variables)
-        apply_op = optimizer.apply_gradients (zip (grads, trainable_variables),
-                                              global_step=self.global_step,
-                                              name='train_step')
+        apply_op = optimizer.minimize(loss=self.l2_l,
+                                      global_step=self.global_step,
+                                      var_list=trainable_variables,
+                                      name='train_step')
         train_ops = [apply_op] + self._extra_train_ops
         self.train_op = tf.group(*train_ops)
         log.info("built training")
diff --git a/doc/train/parallel-training.md b/doc/train/parallel-training.md
@@ -5,13 +5,19 @@ Currently, parallel training is enabled in a sychoronized way with help of [Horo
 Testing `examples/water/se_e2_a` on a 8-GPU host, linear acceleration can be observed with increasing number of cards.
 | Num of GPU cards | Seconds every 100 samples | Samples per second | Speed up |
 |  --  | -- | -- | -- |
-| 1  | 1.6116 | 62.05 | 1.00 |
-| 2  | 1.6310 | 61.31 | 1.98 |
-| 4  | 1.6168 | 61.85 | 3.99 |
-| 8  | 1.6212 | 61.68 | 7.95 |
+| 1  | 1.4515 | 68.89 | 1.00 |
+| 2  | 1.5962 | 62.65*2 | 1.82 |
+| 4  | 1.7635 | 56.71*4 | 3.29 |
+| 8  | 1.7267 | 57.91*8 | 6.72 |
 
 To experience this powerful feature, please intall Horovod and [mpi4py](https://github.com/mpi4py/mpi4py) first. For better performance on GPU, please follow tuning steps in [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.rst).
 ```bash
+# With GPU, prefer NCCL as communicator.
+HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_NCCL_HOME=/path/to/nccl pip3 install horovod mpi4py
+```
+
+If your work in CPU environment, please prepare runtime as below:
+```bash
 # By default, MPI is used as communicator.
 HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_TENSORFLOW=1 pip install horovod mpi4py
 ```