Update README.md

lixilinx · web-flow · commit b4b4c11c1d7c · 2025-12-25T10:55:21.000-08:00
diff --git a/README.md b/README.md
@@ -1,5 +1,9 @@
 ## Pytorch implementation of PSGD 
-[2025 updates] The [old PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/preconditioned_stochastic_gradient_descent.py) is deprecated. The [new PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/psgd.py) is a superset of the old one, and further supports four more matmul-only/inverse-free geometries for updating $Q$. The choices $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$ (default) and $dP=P^{0.5} \mathcal{E} P$ relate to the Newton-Schulz iteration. When $Q$ is fitted with bfloat16 precision, recommend to set lr_preconditioner $\gg 0.01$, say $\ge 0.1$, to avoid round-off errors. For the KronWhiten class, recommend to whiten and clip the momentum if the gradients are sparse over time. This [torch.optim DDP wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_ddp.py) example should work well out of the box for most problems.    
+[2025 updates] The [old PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/preconditioned_stochastic_gradient_descent.py) is deprecated. The [new PSGD implementation](https://github.com/lixilinx/psgd_torch/blob/master/psgd.py) is a superset of the old one, and further supports four more matmul-only/inverse-free geometries for updating $Q$. The choices $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$ (default) and $dP=P^{0.5} \mathcal{E} P$ relate to the Newton-Schulz iteration. When $Q$ is fitted with bfloat16 precision, recommend to set lr_preconditioner $\gg 0.01$, say $\ge 0.1$, to avoid round-off errors.  
+
+Two basic torch.optim.Optimizer wrapping examples are provided. They should work well out of the box for most ML problems using NLL losses. 
+* [Single and multi-GPU DDP wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_ddp.py). This example wraps a momentum whitening optimizer using Kron preconditioner fitted with $dQ=Q^{0.5} \mathcal{E} Q^{1.5}$, which essentially is an online Newton-Schulz iteration for the inverse fourth root of momentum autocorrelation matrix.     
+* [DTensor-based wrapping](https://github.com/lixilinx/psgd_torch/blob/master/wrapped_as_torch_optimizer_for_dtensor.py). Similar to the DDP wrapping, but it whitens each slice of the momentum independently. It's good for DTensor-based distributed training, say FSDP and TP.     
 
 ### An overview
 PSGD (Preconditioned SGD) is a general purpose (mathematical and stochastic, convex and nonconvex) 2nd order optimizer. It reformulates a wide range of preconditioner estimation and Hessian fitting problems as a family of strongly convex Lie group optimization problems.