Skip to content

Commit b4b4c11

Browse files
authored
Update README.md
1 parent 89b4cea commit b4b4c11

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
## Pytorch implementation of PSGD
2-
[2025 updates] The [old PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/preconditioned_stochastic_gradient_descent.py) is deprecated. The [new PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/psgd.py) is a superset of the old one, and further supports four more matmul-only/inverse-free geometries for updating $Q$. The choices $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$ (default) and $dP=P^{0.5} \mathcal{E} P$ relate to the Newton-Schulz iteration. When $Q$ is fitted with bfloat16 precision, recommend to set lr_preconditioner $\gg 0.01$, say $\ge 0.1$, to avoid round-off errors. For the KronWhiten class, recommend to whiten and clip the momentum if the gradients are sparse over time. This [torch.optim DDP wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_ddp.py) example should work well out of the box for most problems.
2+
[2025 updates] The [old PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/preconditioned_stochastic_gradient_descent.py) is deprecated. The [new PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/psgd.py) is a superset of the old one, and further supports four more matmul-only/inverse-free geometries for updating $Q$. The choices $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$ (default) and $dP=P^{0.5} \mathcal{E} P$ relate to the Newton-Schulz iteration. When $Q$ is fitted with bfloat16 precision, recommend to set lr_preconditioner $\gg 0.01$, say $\ge 0.1$, to avoid round-off errors.
3+
4+
Two basic torch.optim.Optimizer wrapping examples are provided. They should work well out of the box for most ML problems using NLL losses.
5+
* [Single and multi-GPU DDP wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_ddp.py). This example wraps a momentum whitening optimizer using Kron preconditioner fitted with $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$, which essentially is an online Newton-Schulz iteration for the inverse fourth root of momentum autocorrelation matrix.
6+
* [DTensor-based wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_dtensor.py). Similar to the DDP wrapping, but it whitens each slice of the momentum independently. It's good for DTensor-based distributed training, say FSDP and TP.
37

48
### An overview
59
PSGD (Preconditioned SGD) is a general purpose (mathematical and stochastic, convex and nonconvex) 2nd order optimizer. It reformulates a wide range of preconditioner estimation and Hessian fitting problems as a family of strongly convex Lie group optimization problems.

0 commit comments

Comments
 (0)