Skip to content

Conversation

@mkhona-nvidia
Copy link
Contributor

Building the helper functions for the PSGD Kron optimizer By Xilin-Li.

Includes subspace-iteration based spectral lower bound calculation, an online procrustes solver and functions to apply the kronecker-factored upper triangular factored preconditioner.

@mkhona-nvidia mkhona-nvidia requested a review from skyw October 3, 2025 17:29
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mkhona-nvidia mkhona-nvidia changed the title [DRAFT] PSGD-Kron's helper functions PSGD-Kron's helper functions Oct 3, 2025
@mkhona-nvidia mkhona-nvidia self-assigned this Oct 6, 2025
@mkhona-nvidia mkhona-nvidia requested a review from a team as a code owner October 7, 2025 22:50
@evanatyourservice
Copy link

@mkhona-nvidia This is really awesome work! The functions are looking accurate.

I'll mention these for down the road:

  • LR: psgd's default LR should be /5.0 compared to adam's (e.g. if we consider adam's default to be .001, psgd's default is then .0002)
  • momentum behavior: in Xilin's code, the default I believe is no momentum and not passing momentum into the preconditioner update. For modern deep learning, be sure to have the default as having momentum, passing it into the preconditioner update, and being what is whitened for the final update. To spell it out the pseudocode flow is:
    1) do momentum ema
    2) Pg = apply_qs(Qs, momentum_with_damping_noise)
    3) do Q update
    4) Pg_out = apply_qs(Qs_new, clean_momentum)
  • betas: i've found momentum beta of 0.95 and betaL of 0.95 to work very well but xilin's defaults are 0.9 for both of these so you can probably choose whichever you'd like

@mkhona-nvidia
Copy link
Contributor Author

/ok to test b066ed1

@mkhona-nvidia
Copy link
Contributor Author

@mkhona-nvidia This is really awesome work! The functions are looking accurate.

I'll mention these for down the road:

  • LR: psgd's default LR should be /5.0 compared to adam's (e.g. if we consider adam's default to be .001, psgd's default is then .0002)
  • momentum behavior: in Xilin's code, the default I believe is no momentum and not passing momentum into the preconditioner update. For modern deep learning, be sure to have the default as having momentum, passing it into the preconditioner update, and being what is whitened for the final update. To spell it out the pseudocode flow is:
    1. do momentum ema
    2. Pg = apply_qs(Qs, momentum_with_damping_noise)
    3. do Q update
    4. Pg_out = apply_qs(Qs_new, clean_momentum)
  • betas: i've found momentum beta of 0.95 and betaL of 0.95 to work very well but xilin's defaults are 0.9 for both of these so you can probably choose whichever you'd like

Sounds good; Once this PR is merged, I will have another PR that actually uses the building blocks of this PR to make PSGD-Kron-Pro with this flow and defaults

@mkhona-nvidia
Copy link
Contributor Author

/ok to test 2062d44

@mkhona-nvidia
Copy link
Contributor Author

/ok to test 0788dc9

@mkhona-nvidia
Copy link
Contributor Author

/ok to test 2fe3ce3

skyw
skyw previously approved these changes Oct 8, 2025
@mkhona-nvidia
Copy link
Contributor Author

/ok to test 5006bb9

@skyw skyw merged commit d1e462c into NVIDIA-NeMo:main Oct 8, 2025
12 checks passed
pablo-garay pushed a commit that referenced this pull request Oct 10, 2025
* cleaned up lower bound function for spectral norm based on Xi-lin's latest code

Signed-off-by: mikail <[email protected]>
Signed-off-by: Pablo Garay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants