Skip to content

[Feature] Fast LoRA Switching for Real-time Interactive Applications #896

@sqhuang

Description

@sqhuang

Checklist

Motivation

Motivation

Current LoRA switching workflow has performance issues for real-time interactive scenarios:

  1. reset_lora() → reload original low-rank weights
  2. update_lora_params() → load LoRA file from disk, format conversion, concat, loadDict()

This process takes ~100ms+ per switch, which is too slow for interactive applications where users frequently switch between LoRA styles (e.g., anime/realistic/base).

Proposed Solution

Pre-build multiple unquant branch variants and load them all into GPU memory. Switching only requires calling loadDict() with the preloaded variant.

Performance comparison:

Operation Current Optimized
Load file Every switch Preload once
Format conversion Every switch Precomputed
Concat Every switch Precomputed
loadDict Every switch Every switch

Result: Switch time reduced from ~100ms+ to <1ms (100x+ speedup)

Trade-off

Memory usage increases from 1 quant + 1 unquant to 1 quant + N unquant variants. Since unquant parts are low-rank, the extra memory cost is acceptable for most use cases.

Proposed API

# Preload multiple LoRA variants
transformer.preload_loras({
    "anime": "path/to/anime.safetensors",
    "realistic": "path/to/realistic.safetensors",
    "none": None,  # Base model without LoRA
})

# Fast switch (<1ms)
transformer.switch_lora("anime")
transformer.switch_lora("realistic")
transformer.switch_lora("none")

# Utility methods
transformer.list_preloaded_loras()  # ["anime", "realistic", "none"]
transformer.get_active_lora()       # "none"
transformer.unload_lora_variant("anime")  # Free memory
transformer.clear_preloaded_loras()       # Clear all

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions