You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/fluid/design/algorithm/parameter_average.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
# Averaging Parameter in PaddlePaddle
2
2
3
3
## Why Averaging
4
-
In a large scale machine learning setup where the size of the training data is huge, it could take us a large number of iterations over the training data before we can achieve the optimal values of parameters of our model. Looking at the problem setup, it is desirable if we can obtain the optimal values of parameters by going through the data in as few passes as we can.
4
+
In a large scale machine learning setup where the size of the training data is huge, it could take us a large number of iterations over the training data before we can achieve the optimal values of parameters of our model. Looking at the problem setup, it is desirable to obtain the optimal values of parameters by going through the data in as few passes as possible.
5
5
6
6
Polyak and Juditsky (1992) showed that the test performance of simple average of parameters obtained by Stochastic Gradient Descent (SGD) is as good as that of parameter values that are obtained by training the model over and over again, over the training dataset.
7
7
@@ -16,16 +16,16 @@ We propose averaging for any optimizer similar to how ASGD performs it, as menti
16
16
### How to perform Parameter Averaging in PaddlePaddle
17
17
18
18
Parameter Averaging in PaddlePaddle works in the following way during training :
19
-
1. It will take in an instance of a normal optimizer as an input, e.g. RMSPropOptimizer
19
+
1. It will take in an instance of an optimizer as an input, e.g. RMSPropOptimizer
20
20
2. The optimizer itself is responsible for updating the parameters.
21
21
3. The ParameterAverageOptimizer maintains a separate copy of the parameters for itself:
22
-
1. In concept, the values of this copy are the average of the values of the parameters in the most recent N batches.
23
-
2. However, saving all the N instances of the parameters in memory is not feasible.
22
+
1. In theory, the values of this copy are the average of the values of the parameters in the most recent N batches.
23
+
2. However, saving all N instances of the parameters in memory is not feasible.
24
24
3. Therefore, an approximation algorithm is used.
25
25
26
26
Hence, overall we have have two copies of the parameters: one for the optimizer itself, and one for the ParameterAverageOptimizer. The former should be used in back propagation, while the latter should be used during testing and should be saved.
27
27
28
-
During the testing/saving the model phase, we perform the following steps:
28
+
During the testing/saving the model phase, we perform the following steps:
29
29
1. Perform the delayed operations.
30
30
2. Save current values of the parameters to a temporary variable.
31
31
3. Replace the values of the parameters with the averaged values.
0 commit comments