Skip to content

"Fine-tuning MiniCPM-V-2.6 with QLoRA on a single 12GB GPU fails with TypeError: device() received an invalid combination of arguments - got (NoneType) #1040

@Laxmikant-ArnishAi

Description

@Laxmikant-ArnishAi

Hello,
I am trying to fine-tune the openbmb/MiniCPM-V-2_6 model on a custom handwriting dataset (GNHK) using a single NVIDIA RTX 3060 with 12GB of VRAM.
I am running into a TypeError that seems to be caused by a conflict between using QLoRA with a CPU device map and the accelerate library. The fine-tuning script crashes because it cannot handle a model that is not on a GPU device.
Here is the final error log, my configuration, and my environment details.
Final Error Log
code
Code
/home/engineeringpc/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py:28: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import packaging # type: ignore[attr-defined]
2025-11-13 12:52:36.762313: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2025-11-13 12:52:36.789539: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-11-13 12:52:37.226369: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
torch_dtype is deprecated! Use dtype instead!
Loading checkpoint shards: 100%|████████████████████| 2/2 [00:00<00:00, 2.49it/s]
Currently using LoRA for fine-tuning the MiniCPM-V model.
{'Total': 4676436720, 'Trainable': 682268912}
llm_type=qwen2
Loading data...
/home/engineeringpc/Desktop/OCR_minicpm_v1.2_finetune/MiniCPM-V/finetune/finetune.py:279: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CPMTrainer.__init__. Use processing_class instead.
trainer = CPMTrainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 151644, 'pad_token_id': 151643}.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/engineeringpc/Desktop/OCR_minicpm_v1.2_finetune/MiniCPM-V/finetune/finetune.py", line 296, in
[rank0]: train()
[rank0]: File "/home/engineeringpc/Desktop/OCR_minicpm_v1.2_finetune/MiniCPM-V/finetune/finetune.py", line 286, in train
[rank0]: trainer.train()
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2325, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1559, in prepare
[rank0]: result = tuple(
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1560, in
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1402, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/home/engineeringpc/.local/lib/python3.10/site-packages/accelerate/accelerator.py", line 1789, in prepare_model
[rank0]: elif torch.device(current_device_index) != self.device:
[rank0]: TypeError: device() received an invalid combination of arguments - got (NoneType), but expected one of:
[rank0]: * (torch.device device)
[rank0]: didn't match because some of the arguments have invalid types: (NoneType)
[rank0]: * (str type, int index = -1)

E1113 12:52:41.988000 326883 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 326942) of binary: /usr/bin/python3
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune/finetune.py FAILED
My start_qlora.sh script:
code
Bash
#!/bin/bash
torchrun --nproc_per_node=1 --master_port=6001 finetune/finetune.py
--model_name_or_path model
--llm_type qwen2
--data_path train.json
--eval_data_path test.json
--fp16 true
--do_train
--do_eval
--tune_vision true
--tune_llm false
--use_lora true
--q_lora true
--model_max_length 2048
--max_steps 10000
--eval_steps 1000
--output_dir output/gnhk_qlora
--logging_dir output/gnhk_qlora/logs
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--save_strategy "steps"
--save_steps 1000
--save_total_limit 3
--learning_rate 1e-4
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 10
--gradient_checkpointing true
--report_to "tensorboard"
My finetune.py edits:
To solve an earlier OutOfMemoryError, I was advised to load the model on the CPU first. I edited the AutoModel.from_pretrained call in finetune/finetune.py to be:
code
Python
model = AutoModel.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=True,
torch_dtype=compute_dtype,
device_map={"":"cpu"},
)
My Environment (nvidia-smi):
code
Code
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 41C P8 18W / 170W | 324MiB / 12288MiB | 0% Default |
+-----------------------------------------+------------------------+----------------------+
It seems the accelerate library cannot handle the model being on the CPU during the prepare step.
Could you please provide the correct configuration or code edits to successfully fine-tune with QLoRA on a single 12GB GPU?
Thank you so much for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions