Checklist
Motivation
Motivation
Current LoRA switching workflow has performance issues for real-time interactive scenarios:
reset_lora() → reload original low-rank weights
update_lora_params() → load LoRA file from disk, format conversion, concat, loadDict()
This process takes ~100ms+ per switch, which is too slow for interactive applications where users frequently switch between LoRA styles (e.g., anime/realistic/base).
Proposed Solution
Pre-build multiple unquant branch variants and load them all into GPU memory. Switching only requires calling loadDict() with the preloaded variant.
Performance comparison:
| Operation |
Current |
Optimized |
| Load file |
Every switch |
Preload once |
| Format conversion |
Every switch |
Precomputed |
| Concat |
Every switch |
Precomputed |
| loadDict |
Every switch |
Every switch |
Result: Switch time reduced from ~100ms+ to <1ms (100x+ speedup)
Trade-off
Memory usage increases from 1 quant + 1 unquant to 1 quant + N unquant variants. Since unquant parts are low-rank, the extra memory cost is acceptable for most use cases.
Proposed API
# Preload multiple LoRA variants
transformer.preload_loras({
"anime": "path/to/anime.safetensors",
"realistic": "path/to/realistic.safetensors",
"none": None, # Base model without LoRA
})
# Fast switch (<1ms)
transformer.switch_lora("anime")
transformer.switch_lora("realistic")
transformer.switch_lora("none")
# Utility methods
transformer.list_preloaded_loras() # ["anime", "realistic", "none"]
transformer.get_active_lora() # "none"
transformer.unload_lora_variant("anime") # Free memory
transformer.clear_preloaded_loras() # Clear all
Checklist
Motivation
Motivation
Current LoRA switching workflow has performance issues for real-time interactive scenarios:
reset_lora()→ reload original low-rank weightsupdate_lora_params()→ load LoRA file from disk, format conversion, concat, loadDict()This process takes ~100ms+ per switch, which is too slow for interactive applications where users frequently switch between LoRA styles (e.g., anime/realistic/base).
Proposed Solution
Pre-build multiple unquant branch variants and load them all into GPU memory. Switching only requires calling
loadDict()with the preloaded variant.Performance comparison:
Result: Switch time reduced from ~100ms+ to <1ms (100x+ speedup)
Trade-off
Memory usage increases from 1 quant + 1 unquant to 1 quant + N unquant variants. Since unquant parts are low-rank, the extra memory cost is acceptable for most use cases.
Proposed API