-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Hello,
I am trying to fine-tune the openbmb/MiniCPM-V-2_6 model on a custom handwriting dataset (GNHK) using a single NVIDIA RTX 3060 with 12GB of VRAM.
I am running into a TypeError that seems to be caused by a conflict between using QLoRA with a CPU device map and the accelerate library. The fine-tuning script crashes because it cannot handle a model that is not on a GPU device.
Here is the final error log, my configuration, and my environment details.
Final Error Log
code
Code
/home/engineeringpc/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:28: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import packaging # type: ignore[attr-defined]
2025-11-13 12:52:36.762313: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-11-13 12:52:36.789539: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-13 12:52:37.226369: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
torch_dtype is deprecated! Use dtype instead!
Loading checkpoint shards: 100%|████████████████████| 2/2 [00:00<00:00, 2.49it/s]
Currently using LoRA for fine-tuning the MiniCPM-V model.
{'Total': 4676436720, 'Trainable': 682268912}
llm_type=qwen2
Loading data...
/home/engineeringpc/Desktop/OCR_minicpm_v1.2_finetune/MiniCPM-V/finetune/finetune.py:279: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CPMTrainer.__init__. Use processing_class instead.
trainer = CPMTrainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 151644, 'pad_token_id': 151643}.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/engineeringpc/Desktop/OCR_minicpm_v1.2_finetune/MiniCPM-V/finetune/finetune.py", line 296, in
[rank0]: train()
[rank0]: File "/home/engineeringpc/Desktop/OCR_minicpm_v1.2_finetune/MiniCPM-V/finetune/finetune.py", line 286, in train
[rank0]: trainer.train()
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1559, in prepare
[rank0]: result = tuple(
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1560, in
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1402, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1789, in prepare_model
[rank0]: elif torch.device(current_device_index) != self.device:
[rank0]: TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
[rank0]: * (torch.device device)
[rank0]: didn't match because some of the arguments have invalid types: (NoneType)
[rank0]: * (str type, int index = -1)
E1113 12:52:41.988000 326883 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 326942) of binary: /usr/bin/python3
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune/finetune.py FAILED
My start_qlora.sh script:
code
Bash
#!/bin/bash
torchrun --nproc_per_node=1 --master_port=6001 finetune/finetune.py
--model_name_or_path model
--llm_type qwen2
--data_path train.json
--eval_data_path test.json
--fp16 true
--do_train
--do_eval
--tune_vision true
--tune_llm false
--use_lora true
--q_lora true
--model_max_length 2048
--max_steps 10000
--eval_steps 1000
--output_dir output/gnhk_qlora
--logging_dir output/gnhk_qlora/logs
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--save_strategy "steps"
--save_steps 1000
--save_total_limit 3
--learning_rate 1e-4
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 10
--gradient_checkpointing true
--report_to "tensorboard"
My finetune.py edits:
To solve an earlier OutOfMemoryError, I was advised to load the model on the CPU first. I edited the AutoModel.from_pretrained call in finetune/finetune.py to be:
code
Python
model = AutoModel.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=True,
torch_dtype=compute_dtype,
device_map={"":"cpu"},
)
My Environment (nvidia-smi):
code
Code
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 41C P8 18W / 170W | 324MiB / 12288MiB | 0% Default |
+-----------------------------------------+------------------------+----------------------+
It seems the accelerate library cannot handle the model being on the CPU during the prepare step.
Could you please provide the correct configuration or code edits to successfully fine-tune with QLoRA on a single 12GB GPU?
Thank you so much for your help.