Skip to content

Commit ae39b69

Browse files
committed
docs: ScalableShampoo docstring
1 parent 3c77763 commit ae39b69

File tree

1 file changed

+8
-4
lines changed

1 file changed

+8
-4
lines changed

pytorch_optimizer/optimizer/shampoo.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -151,8 +151,12 @@ class ScalableShampoo(Optimizer, BaseOptimizer):
151151
:param inverse_exponent_override: int. fixed exponent for pre-conditioner, if > 0.
152152
:param start_preconditioning_step: int.
153153
:param preconditioning_compute_steps: int. performance tuning params for controlling memory and compute
154-
requirements. How often to compute pre-conditioner.
155-
:param statistics_compute_steps: int. How often to compute statistics.
154+
requirements. How often to compute pre-conditioner. Ideally, 1 is the best. However, the current implementation
155+
doesn't work on the distributed environment (there are no statistics & pre-conditioners sync among replicas),
156+
compute on the GPU (not CPU) and the precision is fp32 (not fp64).
157+
Also, followed by the paper, `preconditioning_compute_steps` does not have a significant effect on the
158+
performance. So, If you have a problem with the speed, try to set this step bigger (e.g. 1000).
159+
:param statistics_compute_steps: int. How often to compute statistics. usually set to 1 (or 10).
156160
:param block_size: int. Block size for large layers (if > 0).
157161
Block size = 1 ==> Adagrad (Don't do this, extremely inefficient!)
158162
Block size should be as large as feasible under memory/time constraints.
@@ -166,8 +170,8 @@ class ScalableShampoo(Optimizer, BaseOptimizer):
166170
:param diagonal_eps: float. term added to the denominator to improve numerical stability.
167171
:param matrix_eps: float. term added to the denominator to improve numerical stability.
168172
:param use_svd: bool. use SVD instead of Schur-Newton method to calculate M^{-1/p}.
169-
Theoretically, Schur-Newton method is faster than SVD method to calculate M^{-1/p}.
170-
However, the inefficiency of the loop code, SVD is much faster than that.
173+
Theoretically, Schur-Newton method is faster than SVD method. However, the inefficiency of the loop code and
174+
proper svd kernel, SVD is much faster in some cases (usually in case of small models).
171175
see https://github.com/kozistr/pytorch_optimizer/pull/103
172176
"""
173177

0 commit comments

Comments
 (0)