Qwen 2.5 32B training (Lora) and hardware constraints #2705
Replies: 1 comment 5 replies
-
Hey, thanks for the question. Let me try to answer your question. Your overall config looks fine. I don't think grad checkpoint offload is the cause.
|
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Good morning,
There a few questions that I hope others can help with. To explain what I'm doing.
I'm using a Qwen 2.5 model, 32B and am training a LoRA off it, which then I want to merge with the base model, then convert and quantize to GGUF. The full process works "sometimes", and I've had good results, but the inconsistent nature of this all bothers me a bit.
The yaml file I'm using for this is below:
The lines I'm thinking is causing part of my issue is the following:
In the above, to my understanding, it's only loading the weights into memory that are going to be fine tuned, leaving the rest on disk. I've noticed three things that come up with this setup:
This happens in maybe 30% of the runs. I can force quit it and rerun it, and sometimes this clears, and it converges.
When running the following:
That I sometimes will get errors in the following stages:
convert_hf_to_gguf.py
:llama-quantize
:The inconsistent nature of all this bugs me. I haven't found really a root cause that's causing it. I've tried wiping the merge directory and starting from that step and running everything. I'm currently retrying everything from scratch. Part of me thinks that I may have a bad drive in my pool that's corrupting something. I already bought a new drive that's coming in today, but maybe it's something else I'm overlooking.
Also, if anyone can think of better ways I could approach this, I'd appreciate it. Some information about my hardware is:
I did try ideas of using accelerate and multiple cards, but all tensors on one device comes up. But if anyone can help, I'd appreciate it.
Beta Was this translation helpful? Give feedback.
All reactions