Train LoRA over GGUF #3894

woct0rdho · 2026-01-15T03:38:54Z

woct0rdho
Jan 15, 2026

Hi, I've made a proof of concept that we can train LoRA over GGUF rather than bnb 4-bit quantized base model. When using 3-bit rather than 4-bit base model, we can train Qwen-30B-A3B with 16 rather than 24 GB VRAM.

For convenience I'm developing it in my repo https://github.com/woct0rdho/transformers-qwen3-moe-fused#lora-over-gguf , but it also works with many models that are not Qwen and not MoE.

For now it surely has a lot of rough edges, and we need more experiments to check the quality of such LoRA and optimize the training speed.

I'm also planning to upstream it to transformers, see huggingface/transformers#40070

xXMrNidaXx · 2026-02-23T15:30:40Z

xXMrNidaXx
Feb 23, 2026

This is excellent research! Training LoRA over GGUF opens up fine-tuning on consumer hardware.

Why this matters:

GGUF 3-bit = smaller memory footprint than bnb 4-bit
Enables larger models on same VRAM
Qwen-30B-A3B in 16GB is huge for hobbyists

Questions for quality validation:

LoRA quality comparison
- Same dataset, same hyperparams
- LoRA over GGUF vs LoRA over bnb 4-bit
- Eval on benchmarks (MMLU, HumanEval)
Gradient flow
- How are gradients computed through quantized weights?
- Any precision loss in backward pass?
Quantization formats
- Does it work with Q4_K_M, Q5_K_M?
- Any format that performs better for training?

Potential optimizations:

# Fused backward through quantized matmul?
# Custom CUDA kernel for dequant -> matmul -> grad?

Use case we are excited about:

Fine-tune 70B models on 2x 24GB GPUs
Use GGUF Q3 base + LoRA
Merge LoRA back to GGUF for inference

We push the limits of consumer hardware training at Revolution AI — this could be a game-changer for the community.

1 reply

woct0rdho Feb 23, 2026
Author

Speaking of quality, there is one thing worth trying: On the quantized base model, apply a frozen lora to compensate the quantization, then apply a trainable lora. The frozen lora can be obtained from SVD of (fp16 weight matrix - quantized weight matrix). This idea was popularized by SVDQuant (also known as Nunchaku), and AI Toolkit uses this when training quantized Qwen-Image.

The gradient flow is the same as the usual formula. In both forward and backward, when the fp16 weight matrix is needed, we load the quantized weight matrix and dequantize it.

I haven't tried many quantization formats. GGUF is very flexible, and I guess it's worth trying Unsloth's UD quants with extremely low bits for large models.

Fused dequant + matmul kernels are alrady in llama.cpp . We just need to add PyTorch bindings to them, see the discussion in huggingface/transformers#40070 . I may find some time to do this and I hope someone can do this earlier than me.

jhorvat7 · 2026-03-03T19:07:48Z

jhorvat7
Mar 3, 2026

This is going to be a huge boon for the community, especially for folks who don't have the heavy-duty hardware needed to initialize a full 16-bit model in memory just to compress it down to 4-bit. Unsloth team is great at getting models out the door, but waiting on the pre-quantized bnb-4bit models to be officially uploaded can sometimes take a while compared to GGUF availability. Being able to just grab a GGUF and start training LoRA directly is going to make fine-tuning so much more accessible for everyone. Thanks for working on this and I'll be sure to look more indepth into your project.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Train LoRA over GGUF #3894

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Train LoRA over GGUF #3894

Uh oh!

woct0rdho Jan 15, 2026

Replies: 2 comments · 1 reply

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

woct0rdho Feb 23, 2026 Author

Uh oh!

jhorvat7 Mar 3, 2026

woct0rdho
Jan 15, 2026

Replies: 2 comments 1 reply

xXMrNidaXx
Feb 23, 2026

woct0rdho Feb 23, 2026
Author

jhorvat7
Mar 3, 2026