[Feature] Fast LoRA Switching for Real-time Interactive Applications

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/mit-han-lab/nunchaku/discussions/new/choose. Otherwise, it will be closed.
- [x] 2. I will do my best to describe the issue in English.

### Motivation


  ## Motivation

  Current LoRA switching workflow has performance issues for real-time interactive scenarios:

  1. `reset_lora()` → reload original low-rank weights
  2. `update_lora_params()` → load LoRA file from disk, format conversion, concat, loadDict()

  This process takes **~100ms+** per switch, which is too slow for interactive applications where users frequently switch between LoRA styles (e.g., anime/realistic/base).

  ### Proposed Solution

  Pre-build multiple unquant branch variants and load them all into GPU memory. Switching only requires calling `loadDict()` with the preloaded variant.

  **Performance comparison:**

  | Operation | Current | Optimized |
  |-----------|---------|-----------|
  | Load file | Every switch | Preload once |
  | Format conversion | Every switch | Precomputed |
  | Concat | Every switch | Precomputed |
  | loadDict | Every switch | Every switch |

  **Result:** Switch time reduced from ~100ms+ to <1ms (100x+ speedup)

  ### Trade-off

  Memory usage increases from 1 quant + 1 unquant to 1 quant + N unquant variants. Since unquant parts are low-rank, the extra memory cost is acceptable for most use cases.

  ### Proposed API

  ```python
  # Preload multiple LoRA variants
  transformer.preload_loras({
      "anime": "path/to/anime.safetensors",
      "realistic": "path/to/realistic.safetensors",
      "none": None,  # Base model without LoRA
  })

  # Fast switch (<1ms)
  transformer.switch_lora("anime")
  transformer.switch_lora("realistic")
  transformer.switch_lora("none")

  # Utility methods
  transformer.list_preloaded_loras()  # ["anime", "realistic", "none"]
  transformer.get_active_lora()       # "none"
  transformer.unload_lora_variant("anime")  # Free memory
  transformer.clear_preloaded_loras()       # Clear all


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Fast LoRA Switching for Real-time Interactive Applications #896

Checklist

Motivation

Motivation

Proposed Solution

Trade-off

Proposed API

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Current	Optimized
Load file	Every switch	Preload once
Format conversion	Every switch	Precomputed
Concat	Every switch	Precomputed
loadDict	Every switch	Every switch

[Feature] Fast LoRA Switching for Real-time Interactive Applications #896

Description

Checklist

Motivation

Motivation

Proposed Solution

Trade-off

Proposed API

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions