You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: python/paddle/fluid/optimizer.py
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -628,16 +628,16 @@ class DGCMomentumOptimizer(MomentumOptimizer):
628
628
629
629
Original paper is https://arxiv.org/abs/1712.01887
630
630
631
-
DGC reduce the communication bandwidth by sending only the important gradients (sparse update):\
631
+
DGC reduces the communication bandwidth by sending only the important gradients (sparse update):\
632
632
only gradients larger than a threshold are transmitted.
633
633
634
-
To avoid losing information, DGC accumulate the rest of the gradients locally.
634
+
To avoid losing information, DGC accumulates the rest of the gradients locally.
635
635
636
636
Eventually, these gradients become large enough to be transmitted.
637
637
638
-
Thus, DGC send the large gradients immediately but eventually send all of the gradients over time.
638
+
Thus, DGC sends the large gradients immediately but eventually send all of the gradients over time.
639
639
640
-
To ensure no loss of accuracy, DGC employs momentum correc-tionandlocal gradient clipping on top of the gradient sparsification to maintain model performance.
640
+
To ensure no loss of accuracy, DGC employs momentum correction and local gradient clipping on top of the gradient sparsification to maintain model performance.
641
641
642
642
DGC also uses momentum factor masking and warmup training to overcome the staleness problem caused by reduced communication.
643
643
@@ -652,17 +652,17 @@ class DGCMomentumOptimizer(MomentumOptimizer):
652
652
learning_rate (float|Variable): the learning rate used to update parameters. \
653
653
Can be a float value or a Variable with one float value as data element.
654
654
momentum (float): Momentum factor.
655
-
rampup_begin_step (int): The begining step from which gradient compression is implemented.
655
+
rampup_begin_step (int): The beginning step from which gradient compression is implemented.
656
656
rampup_step (int): How long it use the sparsity periods. Default is 1.
657
657
for example: If the sparsity is [0.75, 0.9375, 0.984375, 0.996, 0.999], and the rampup_step is 5, \
658
658
it will use 0.75 at 0 step, and 0.9375 at 1 step, and so on. And when reach sparsity array ends, \
659
659
it will use 0.999 then and after.
660
660
sparsity (list[float]): Get top important element from gradient tensor, the ratio is (1 - current sparsity).
661
661
use_nesterov (bool): Enables Nesterov momentum. True means use nesterov.
662
662
local_grad_clip_norm (float): Clip norm value if needed.
663
-
num_trainers: The number of training node.
663
+
num_trainers: The number of training nodes.
664
664
regularization: A Regularizer, such as fluid.regularizer.L2DecayRegularizer.
0 commit comments