PSGD-Kron's helper functions #37

mkhona-nvidia · 2025-10-03T17:29:30Z

Building the helper functions for the PSGD Kron optimizer By Xilin-Li.

Includes subspace-iteration based spectral lower bound calculation, an online procrustes solver and functions to apply the kronecker-factored upper triangular factored preconditioner.

copy-pr-bot · 2025-10-03T17:29:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

emerging_optimizers/psgd/procrustes_step.py

tests/test_procrustes_step.py

tests/test_psgd_contractions.py

emerging_optimizers/psgd/psgd_utils.py

evanatyourservice · 2025-10-07T23:29:10Z

@mkhona-nvidia This is really awesome work! The functions are looking accurate.

I'll mention these for down the road:

LR: psgd's default LR should be /5.0 compared to adam's (e.g. if we consider adam's default to be .001, psgd's default is then .0002)
momentum behavior: in Xilin's code, the default I believe is no momentum and not passing momentum into the preconditioner update. For modern deep learning, be sure to have the default as having momentum, passing it into the preconditioner update, and being what is whitened for the final update. To spell it out the pseudocode flow is:
1) do momentum ema
2) Pg = apply_qs(Qs, momentum_with_damping_noise)
3) do Q update
4) Pg_out = apply_qs(Qs_new, clean_momentum)
betas: i've found momentum beta of 0.95 and betaL of 0.95 to work very well but xilin's defaults are 0.9 for both of these so you can probably choose whichever you'd like

mkhona-nvidia · 2025-10-07T23:35:13Z

/ok to test b066ed1

mkhona-nvidia · 2025-10-07T23:38:10Z

@mkhona-nvidia This is really awesome work! The functions are looking accurate.

I'll mention these for down the road:

LR: psgd's default LR should be /5.0 compared to adam's (e.g. if we consider adam's default to be .001, psgd's default is then .0002)

momentum behavior: in Xilin's code, the default I believe is no momentum and not passing momentum into the preconditioner update. For modern deep learning, be sure to have the default as having momentum, passing it into the preconditioner update, and being what is whitened for the final update. To spell it out the pseudocode flow is:

do momentum ema

Pg = apply_qs(Qs, momentum_with_damping_noise)

do Q update

Pg_out = apply_qs(Qs_new, clean_momentum)

betas: i've found momentum beta of 0.95 and betaL of 0.95 to work very well but xilin's defaults are 0.9 for both of these so you can probably choose whichever you'd like

Sounds good; Once this PR is merged, I will have another PR that actually uses the building blocks of this PR to make PSGD-Kron-Pro with this flow and defaults

mkhona-nvidia · 2025-10-07T23:44:56Z

/ok to test 2062d44

mkhona-nvidia · 2025-10-08T06:52:00Z

/ok to test 0788dc9

…atest code Signed-off-by: mikail <[email protected]>

Signed-off-by: mikail <[email protected]>

mkhona-nvidia · 2025-10-08T16:10:47Z

/ok to test 2fe3ce3

Signed-off-by: mikail <[email protected]>

mkhona-nvidia · 2025-10-08T18:28:27Z

/ok to test 5006bb9

* cleaned up lower bound function for spectral norm based on Xi-lin's latest code Signed-off-by: mikail <[email protected]> Signed-off-by: Pablo Garay <[email protected]>

mkhona-nvidia requested a review from skyw October 3, 2025 17:29

mkhona-nvidia changed the title ~~[DRAFT] PSGD-Kron's helper functions~~ PSGD-Kron's helper functions Oct 3, 2025

mkhona-nvidia force-pushed the mkhona/psgd_kron branch from e2c3b93 to 7b54565 Compare October 3, 2025 23:57

mkhona-nvidia self-assigned this Oct 6, 2025

skyw requested changes Oct 7, 2025

View reviewed changes

mkhona-nvidia requested a review from a team as a code owner October 7, 2025 22:50

mkhona-nvidia force-pushed the mkhona/psgd_kron branch from 8f81cea to 4c13635 Compare October 7, 2025 23:15

mkhona-nvidia force-pushed the mkhona/psgd_kron branch from 2797463 to b066ed1 Compare October 7, 2025 23:30

copy-pr-bot bot temporarily deployed to test October 7, 2025 23:35 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 7, 2025 23:35 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 7, 2025 23:37 Error

copy-pr-bot bot temporarily deployed to test October 7, 2025 23:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 7, 2025 23:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 7, 2025 23:46 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 7, 2025 23:46 Failure

copy-pr-bot bot temporarily deployed to test October 8, 2025 06:52 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 8, 2025 06:52 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 8, 2025 06:54 Failure

mkhona-nvidia added 4 commits October 8, 2025 09:05

cleaned up lower bound function for spectral norm based on Xi-lin's l…

276b68c

…atest code Signed-off-by: mikail <[email protected]>

added tests for contraction functions

cf134b8

Signed-off-by: mikail <[email protected]>

added tests for psgd's utils

b43a123

Signed-off-by: mikail <[email protected]>

reduced procrustes step to a single functiona and wrote docs

7a9452e

Signed-off-by: mikail <[email protected]>

mkhona-nvidia added 11 commits October 8, 2025 09:05

added license file to test

a69a97b

Signed-off-by: mikail <[email protected]>

removed trailing whitespace

fe0d8ef

Signed-off-by: mikail <[email protected]>

addressed PR comments

1e9ba21

Signed-off-by: mikail <[email protected]>

added torch compile

852012b

Signed-off-by: mikail <[email protected]>

added psgd tests to ci

33584b7

Signed-off-by: mikail <[email protected]>

moved procrustes test frm cuda to cpu

da88d06

Signed-off-by: mikail <[email protected]>

revert docs change

9348d02

Signed-off-by: mikail <[email protected]>

removed extra doc from soap

397c1af

Signed-off-by: mikail <[email protected]>

fixed tests and relaxed some tolerances

15112ce

Signed-off-by: mikail <[email protected]>

changed from normalization to a max(norm, eps) re:PR discussions

848fa9b

Signed-off-by: mikail <[email protected]>

replaced max with clamp

e5b5ba5

Signed-off-by: mikail <[email protected]>

mkhona-nvidia force-pushed the mkhona/psgd_kron branch from 835ab1d to e5b5ba5 Compare October 8, 2025 16:06

reduced tolerance for identity matrix test

2fe3ce3

Signed-off-by: mikail <[email protected]>

copy-pr-bot bot temporarily deployed to test October 8, 2025 16:10 Inactive

skyw previously approved these changes Oct 8, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci October 8, 2025 16:45 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 8, 2025 17:27 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci October 8, 2025 17:27 Failure

improved test for random matrices for skew bound

e0388ff

Signed-off-by: mikail <[email protected]>

mkhona-nvidia dismissed skyw’s stale review via e0388ff October 8, 2025 18:18

reduced tolerance for making test pass on GPU

5006bb9

Signed-off-by: mikail <[email protected]>

copy-pr-bot bot temporarily deployed to test October 8, 2025 18:28 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 8, 2025 18:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci October 8, 2025 18:47 Inactive

skyw approved these changes Oct 8, 2025

View reviewed changes

skyw merged commit d1e462c into NVIDIA-NeMo:main Oct 8, 2025
12 checks passed

pablo-garay pushed a commit that referenced this pull request Oct 10, 2025

Add PSGD-Kron's helper functions (#37)

e32df16

* cleaned up lower bound function for spectral norm based on Xi-lin's latest code Signed-off-by: mikail <[email protected]> Signed-off-by: Pablo Garay <[email protected]>

PSGD-Kron's helper functions #37

PSGD-Kron's helper functions #37

Uh oh!

Conversation

mkhona-nvidia commented Oct 3, 2025

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

evanatyourservice commented Oct 7, 2025

Uh oh!

mkhona-nvidia commented Oct 7, 2025

Uh oh!

mkhona-nvidia commented Oct 7, 2025

Uh oh!

mkhona-nvidia commented Oct 7, 2025

Uh oh!

mkhona-nvidia commented Oct 8, 2025

Uh oh!

mkhona-nvidia commented Oct 8, 2025

Uh oh!

mkhona-nvidia commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants