Seed expanded LoRA dimensions with noise instead of zeros#724
Open
Stefatorus wants to merge 1 commit intoostris:mainfrom
Open
Seed expanded LoRA dimensions with noise instead of zeros#724Stefatorus wants to merge 1 commit intoostris:mainfrom
Stefatorus wants to merge 1 commit intoostris:mainfrom
Conversation
When expanding a pretrained LoRA to a higher rank, the new dimensions were zero-initialized on both lora_down (A) and lora_up (B). This creates a dead zone where neither side can receive gradients: dL/dB[:,i] depends on A[i,:] (zero) and dL/dA[i,:] depends on B[:,i] (zero), so the new dimensions can never learn. Seed new dimensions with small noise scaled relative to existing weights: kaiming-style for lora_down, small random for lora_up. This preserves the original ΔW (perturbation <1%) while ensuring gradients flow through all new dimensions from step 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
lora_downandlora_upwere zero-initialized withtorch.zeros(). This creates a gradient dead zone:dL/dB[:,i]depends onA[i,:]being nonzero, anddL/dA[i,:]depends onB[:,i]being nonzero. When both sides are zero, neither can ever receive a gradient — the new dimensions are permanently dead.lora_down, small random forlora_up) scaled relative to the existing learned weights. The perturbation to the original ΔW is <1%, but all new dimensions now have nonzero gradients from step 1.Problem
Expanding a rank-16 LoRA to rank-64 with Prodigy optimizer, we observed via SVD analysis that effective rank @ 95% energy stayed flat at ~10.5 across hundreds of training steps — none of the 48 new dimensions were ever utilized. The root cause is the zero-initialization creating a fixed point in the loss landscape that gradient descent cannot escape.
Solution
Replace
torch.zeros()padding with small noise:std = (weight_scale * 0.1) / sqrt(fan_in)1%of typical column magnitudeThis matches the asymmetry of standard LoRA initialization (larger A, smaller B) while keeping the perturbation small enough to preserve the pretrained signal.
Test plan
🤖 Generated with Claude Code
I was working on training a LoRA on Flux 2 Dev, initialized with Rank 16, and wanted to expand the rank later in order to see if i can escape a plateau.
Ai-Toolkit does allow for graceful expansion of LoRA rank, unfortunately, though it seeds the new dimensions with 0 (which blocks the gradient from passing through).
I've solved this on my end by manually seeding the converted LoRA with noise as the new dimensions were unused, and have asked Claude Code to commit the fix.