-[2025 updates] The [old PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/preconditioned_stochastic_gradient_descent.py) is deprecated. The [new PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/psgd.py) is a superset of the old one, and further supports four more matmul-only/inverse-free geometries for updating $Q$. The choices $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$ (default) and $dP=P^{0.5} \mathcal{E} P$ relate to the Newton-Schulz iteration. When $Q$ is fitted with bfloat16 precision, recommend to set lr_preconditioner $\gg 0.01$, say $\ge 0.1$, to avoid round-off errors. For the KronWhiten class, recommend to whiten and clip the momentum if the gradients are sparse over time. This [torch.optim DDP wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_ddp.py) example should work well out of the box for most problems.
0 commit comments