Train LoRA over GGUF #3894
Replies: 2 comments 1 reply
-
|
This is excellent research! Training LoRA over GGUF opens up fine-tuning on consumer hardware. Why this matters:
Questions for quality validation:
Potential optimizations: # Fused backward through quantized matmul?
# Custom CUDA kernel for dequant -> matmul -> grad?Use case we are excited about:
We push the limits of consumer hardware training at Revolution AI — this could be a game-changer for the community. |
Beta Was this translation helpful? Give feedback.
-
|
This is going to be a huge boon for the community, especially for folks who don't have the heavy-duty hardware needed to initialize a full 16-bit model in memory just to compress it down to 4-bit. Unsloth team is great at getting models out the door, but waiting on the pre-quantized bnb-4bit models to be officially uploaded can sometimes take a while compared to GGUF availability. Being able to just grab a GGUF and start training LoRA directly is going to make fine-tuning so much more accessible for everyone. Thanks for working on this and I'll be sure to look more indepth into your project. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I've made a proof of concept that we can train LoRA over GGUF rather than bnb 4-bit quantized base model. When using 3-bit rather than 4-bit base model, we can train Qwen-30B-A3B with 16 rather than 24 GB VRAM.
For convenience I'm developing it in my repo https://github.com/woct0rdho/transformers-qwen3-moe-fused#lora-over-gguf , but it also works with many models that are not Qwen and not MoE.
For now it surely has a lot of rough edges, and we need more experiments to check the quality of such LoRA and optimize the training speed.
I'm also planning to upstream it to transformers, see huggingface/transformers#40070
Beta Was this translation helpful? Give feedback.
All reactions