Qwen 2.5 32B training (Lora) and hardware constraints #2705

TheDarkTrumpet · 2025-05-22T12:20:39Z

TheDarkTrumpet
May 22, 2025

Good morning,

There a few questions that I hope others can help with. To explain what I'm doing.

I'm using a Qwen 2.5 model, 32B and am training a LoRA off it, which then I want to merge with the base model, then convert and quantize to GGUF. The full process works "sometimes", and I've had good results, but the inconsistent nature of this all bothers me a bit.

The yaml file I'm using for this is below:

# Originally taken from: https://github.com/axolotl-ai-cloud/axolotl/blob/a27b909c5c1c2c561a8d503024b89afcce15226f/examples/qwen3/32b-qlora.yaml
base_model: Qwen/Qwen2.5-32B

plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false

chat_template: qwen_25
datasets:
  - path: ./no_git/17_suggested_changes_with_new.csv
    type: alpaca
val_set_size: 0
eval_sample_packing: true  

output_dir: ./no_git/17_qwenmspe_changes/
dataset_prepared_path: last_run_prepared

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

load_in_4bit: true
adapter: qlora
lora_r: 32 # up from 16
lora_alpha: 64 # Up from 32
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - down_proj
  - up_proj
lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 10
optimizer: adamw_torch_4bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: true

gradient_checkpointing: offload
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_steps: 50 # From 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.01 # up from 0
special_tokens:
use_tensorboard: true

The lines I'm thinking is causing part of my issue is the following:

gradient_checkpointing: offload
gradient_checkpointing_kwargs:
  use_reentrant: false

In the above, to my understanding, it's only loading the weights into memory that are going to be fine tuned, leaving the rest on disk. I've noticed three things that come up with this setup:

Warnings in the training operation (maybe not related to the above):

[2025-05-22 07:04:18,362] [WARNING] [axolotl.monkeypatch.lora_kernels.warning_once:82] [PID:2514185] [RANK:0] Cannot patch some attention QKV projections - requires LoRA adapters with no bias
[2025-05-22 07:04:18,362] [WARNING] [axolotl.monkeypatch.lora_kernels.warning_once:82] [PID:2514185] [RANK:0] Cannot patch some MLP layers - requires LoRA adapters with no bias

The "loss" will stay consistent, and "grad_norm" will remain at 0 throughout training.

This happens in maybe 30% of the runs. I can force quit it and rerun it, and sometimes this clears, and it converges.

Merge and Post process issues.

When running the following:

python3 -m axolotl.cli.merge_lora 17-train-qwen32b-lora.yaml --lora_model_dir="./no_git/17_qwenmspe_changes"

python3 convert_hf_to_gguf.py ~/path/to/17_qwenmspe_changes/merged/

llama-quantize ~/path/to/17_qwenmspe_changes/merged/Merged-33B-F16.gguf ~/path/to/17_qwenmspe_changes/msom-qwen32b-mspe-suggestions-Q5.gguf Q5_0

That I sometimes will get errors in the following stages:

convert_hf_to_gguf.py:

/path/to/llama.cpp/gguf-py/gguf/lazy.py:217: RuntimeWarning: overflow encountered in cast
  return type(self)(meta=meta, args=full_args, kwargs=kwargs, func=(lambda a, *args, **kwargs: a.astype(*args, **kwargs)))
Writing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 65.5G/65.5G [06:56<00:00, 157Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to .....

llama-quantize:

[ 432/ 771]               blk.35.ffn_down.weight - [27648,  5120,     1,     1], type =    f16, converting to q5_0 .. size =   270.00 MiB ->    92.81 MiB
ggml_validate_row_data: found inf value at block 85807928
llama_model_quantize: failed to quantize: tensor 'blk.35.ffn_gate.weight' has invalid data

The inconsistent nature of all this bugs me. I haven't found really a root cause that's causing it. I've tried wiping the merge directory and starting from that step and running everything. I'm currently retrying everything from scratch. Part of me thinks that I may have a bad drive in my pool that's corrupting something. I already bought a new drive that's coming in today, but maybe it's something else I'm overlooking.

Also, if anyone can think of better ways I could approach this, I'd appreciate it. Some information about my hardware is:

AMD Threadripper 24/48
128Gb of onboard system Ram.
3x Nvidia ADA A6000 cards (48Gb of RAM a piece)

I did try ideas of using accelerate and multiple cards, but all tensors on one device comes up. But if anyone can help, I'd appreciate it.

NanoCode012 · 2025-05-22T12:31:11Z

NanoCode012
May 22, 2025
Maintainer

Hey, thanks for the question. Let me try to answer your question.

Your overall config looks fine. I don't think grad checkpoint offload is the cause.

I'll ask the team about lora kernel warnings.
Where does it stall? Is there some pattern? Do you have some metric charts?
For post quantize, perhaps you may find some help in their respective forums? I'm not really clear on that error/warning.

5 replies

TheDarkTrumpet May 22, 2025
Author

Hi NanoCode,

Thanks for the quick reply and being willing to assist. To answer the questions:

Thanks for following up with the team about those warnings. I took the example from a Qwen 3 lora, with only minor changes, so I'm glad to hear I'm likely doing things right at least.
For stalling, it happens pretty immediately, and stays consistent. Unfortunately, the tensor log files aren't available for this (deleted the directory when restarting from scratch). Generally speaking, when I see this "stalling", I let it run for about an epoch to see if it gets itself out of that rut, but it hasn't. Going by memory, the loss would stick at a consistent 12.6 (multiple runs, same number). The grad_normal would be at 0. The learning_rate would stay consistent (but I don't recall that number either).
Thanks for the suggestion to try going to llama.cpp's discussion area to ask. I'll likely do that (bit more on that below).

Part of me is hedging my bets that this may be hardware. I have multiple NVMEs that are quite old all in a pool. I've had gguf files become "corrupt" before. I haven't seen this "stalling" since I wiped my huggingface model directory cache for this model, and had it repull. The invalid data, though (part of the merge process) could also be hardware but maybe not either.

I should have the new drive roughly mid afternoon. Once I do, I'll plan on moving this entire project to that drive, and symlinking the huggingface cache directory also to this drive. Then, wipe and reload. I'll run the test a few times to see if I can duplicate all this. Also, if I encounter this stall issue again, I'll make sure to grab that tensorboard directory and any log files I can and post them here.

Again, thank you very much for the quick reply and being willing to help. I'll make sure to respond here tomorrow morning since I should be able to get through at least 2 runs by then and can see if this continues.

TheDarkTrumpet May 22, 2025
Author

Sorry for the double reply, but one question that may be worth asking if possible. So, the offloading I'm doing right now. If I did want to load this across my cards and utilize all 3 (which, may barely fit this model), is that possible? There's a few reasons for this:

I'm thinking if I wanted to do a full fine tuning, if I have the hardware to do that operation.
To see if this lora warning would stick around if I had it all loaded into memory. Just to see if this warning is tied to the offloading or the patching that's going on.

TheDarkTrumpet May 23, 2025
Author

I did some more testing. Took awhile. I ran the test twice with the new drive and it worked both times. Pretty confident it's a drive issue in the internal machine, and something i'll need to deal with.

Thanks again for your help before.

NanoCode012 May 26, 2025
Maintainer

Hey, thanks for the followup over the weekend.

One additional suggestion I have is to run with a metric tracker (we support tensorboard, wandb, etc) to help track these values in case we forget those numbers). For some of the above, it also tracks the memory usage and info about the system.

TheDarkTrumpet May 28, 2025
Author

I appreciate the feedback on that. I do use Tensorboard there. The one change I need to look at and hope to make is to log tensorboard to a dedicated location outside of the output directory. Been looking at collating these logs for awhile now, so I can keep track of things over a longer period without managing log files so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Qwen 2.5 32B training (Lora) and hardware constraints #2705

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Qwen 2.5 32B training (Lora) and hardware constraints #2705

Uh oh!

TheDarkTrumpet May 22, 2025

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

NanoCode012 May 22, 2025 Maintainer

Uh oh!

TheDarkTrumpet May 22, 2025 Author

Uh oh!

TheDarkTrumpet May 22, 2025 Author

Uh oh!

TheDarkTrumpet May 23, 2025 Author

Uh oh!

NanoCode012 May 26, 2025 Maintainer

Uh oh!

TheDarkTrumpet May 28, 2025 Author

TheDarkTrumpet
May 22, 2025

Replies: 1 comment 5 replies

NanoCode012
May 22, 2025
Maintainer

TheDarkTrumpet May 22, 2025
Author

TheDarkTrumpet May 22, 2025
Author

TheDarkTrumpet May 23, 2025
Author

NanoCode012 May 26, 2025
Maintainer

TheDarkTrumpet May 28, 2025
Author