Skip to content

Seed expanded LoRA dimensions with noise instead of zeros#724

Open
Stefatorus wants to merge 1 commit intoostris:mainfrom
Stefatorus:fix/lora-expansion-noise-seeding
Open

Seed expanded LoRA dimensions with noise instead of zeros#724
Stefatorus wants to merge 1 commit intoostris:mainfrom
Stefatorus:fix/lora-expansion-noise-seeding

Conversation

@Stefatorus
Copy link

@Stefatorus Stefatorus commented Feb 23, 2026

Summary

  • When expanding a pretrained LoRA to a higher rank (e.g. 16 → 64), new dimensions on both lora_down and lora_up were zero-initialized with torch.zeros(). This creates a gradient dead zone: dL/dB[:,i] depends on A[i,:] being nonzero, and dL/dA[i,:] depends on B[:,i] being nonzero. When both sides are zero, neither can ever receive a gradient — the new dimensions are permanently dead.
  • This fix seeds new dimensions with small noise (kaiming-style for lora_down, small random for lora_up) scaled relative to the existing learned weights. The perturbation to the original ΔW is <1%, but all new dimensions now have nonzero gradients from step 1.

Problem

Expanding a rank-16 LoRA to rank-64 with Prodigy optimizer, we observed via SVD analysis that effective rank @ 95% energy stayed flat at ~10.5 across hundreds of training steps — none of the 48 new dimensions were ever utilized. The root cause is the zero-initialization creating a fixed point in the loss landscape that gradient descent cannot escape.

pr_graph_utilization pr_graph_rank_distribution pr_graph_sv_spectra (1)

Solution

Replace torch.zeros() padding with small noise:

  • lora_down (A): Kaiming-style noise with std = (weight_scale * 0.1) / sqrt(fan_in)
  • lora_up (B): Small random noise at 1% of typical column magnitude

This matches the asymmetry of standard LoRA initialization (larger A, smaller B) while keeping the perturbation small enough to preserve the pretrained signal.

Test plan

  • Verified original dimensions are preserved exactly (max diff = 0)
  • Verified new dimensions are nonzero on both lora_down and lora_up
  • Verified gradients would flow through all new dimensions
  • Verified perturbation to ΔW is small (~5% with synthetic weights, lower with real trained weights)
  • Integration test: expand a trained LoRA and verify effective rank increases during continued training

🤖 Generated with Claude Code


I was working on training a LoRA on Flux 2 Dev, initialized with Rank 16, and wanted to expand the rank later in order to see if i can escape a plateau.

Ai-Toolkit does allow for graceful expansion of LoRA rank, unfortunately, though it seeds the new dimensions with 0 (which blocks the gradient from passing through).

I've solved this on my end by manually seeding the converted LoRA with noise as the new dimensions were unused, and have asked Claude Code to commit the fix.

When expanding a pretrained LoRA to a higher rank, the new dimensions
were zero-initialized on both lora_down (A) and lora_up (B). This
creates a dead zone where neither side can receive gradients:
dL/dB[:,i] depends on A[i,:] (zero) and dL/dA[i,:] depends on B[:,i]
(zero), so the new dimensions can never learn.

Seed new dimensions with small noise scaled relative to existing
weights: kaiming-style for lora_down, small random for lora_up. This
preserves the original ΔW (perturbation <1%) while ensuring gradients
flow through all new dimensions from step 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant