-
Notifications
You must be signed in to change notification settings - Fork 457
Description
π Describe the bug
Hi, I'm encountering a consistent crash when enabling use_liger_kernel=True during fine-tuning of the Qwen2.5-VL-7B-Instruct model using the 2U1/Qwen-VL-Series-Finetune codebase. The crash occurs exactly around step 875 (and again at step 1675, 2475 when resuming from a checkpoint at step 800, 1600), with the following fatal error:
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
This error suggests a reference counting bug in a C extensionβpotentially within Liger Kernel itself.
Reproduction
- The crash is highly reproducible: every run with
--use_liger_kernel Truefails at step 875 . - The issue appears during backward pass, as indicated by the stack trace pointing into
torch/autogradandtorch/utils/checkpoint.py.
Click to expand error log (sanitized)
{'loss': 1.3778, 'grad_norm': 0.23937849700450897, 'learning_rate': 8.577619427099825e-05, 'epoch': 0.27}
27%|βββββββββββββββββββββββββββββββββ | 874/3246 [1:30:51<3:30:41, 5.33s/it]
27%|βββββββββββββββββββββββββββββββββ | 875/3246 [1:30:56<3:29:14, 5.30s/it]
{'loss': 1.434, 'grad_norm': 0.3069251477718353, 'learning_rate': 8.574131814228779e-05, 'epoch': 0.27}
27%|βββββββββββββββββββββββββββββββββ | 875/3246 [1:30:56<3:29:14, 5.30s/it]Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized
Current thread 0x00007f9ddc9fc640 (most recent call first):
<no Python frame>
Thread 0x00007f9d689c4640 (most recent call first):
<no Python frame>
Thread 0x00007f9d681c3640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f9d679c2640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f9d671c1640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37 in do_one_step
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 61 in _pin_memory_loop
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f9aa89a7640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 331 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 629 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f9ee10ee740 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/graph.py", line 824 in _engine_run_backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/_tensor.py", line 648 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2126 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18 in wrapped_fn
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 270 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/accelerator.py", line 2844 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 4071 in training_step
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2674 in _inner_training_loop
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2325 in train
File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 265 in train
File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 291 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imaging, gmpy2.gmpy2, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, _brotli, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, ujson, cuda_utils, msgpack._cmsgpack, __triton_launcher (total: 149)
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized
Thread 0x00007fafa71fc640 (most recent call first):
<no Python frame>
Current thread 0x00007faf491c4640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/function.py", line 307 in apply
Thread 0x00007faf449c3640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007faf401c2640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007faf3f9c1640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37 in do_one_step
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 61 in _pin_memory_loop
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007fabe17fe640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007fabe1fff640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tensorboardX/event_file_writer.py", line 202 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007fac67fff640 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 331 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 629 in wait
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007fb0ab8a6740 (most recent call first):
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/graph.py", line 824 in _engine_run_backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/_tensor.py", line 648 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2126 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18 in wrapped_fn
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 270 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/accelerator.py", line 2844 in backward
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 4071 in training_step
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2674 in _inner_training_loop
File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2325 in train
File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 265 in train
File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 291 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imaging, gmpy2.gmpy2, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, _brotli, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, ujson, cuda_utils, msgpack._cmsgpack, __triton_launcher (total: 149)
[2025-12-12 11:32:43,783] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1733631
[2025-12-12 11:33:13,808] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1733634
[2025-12-12 11:33:13,814] [ERROR] [launch.py:325:sigkill_handler] ['/xxx/xxx/xxx/tmp/venvs/tf450/bin/python3.11', '-u', 'src/train/train_sft.py', '--local_rank=1', '--use_liger_kernel', 'True', '--lora_enable', 'False', '--use_dora', 'False', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '32', '--lora_alpha', '64', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', '/xxx/xxx/xxx/tmp/models/Qwen2.5-VL-7B-Instruct', '--data_path', '/xxx/xxx/xxx/tmp/dataset/finetune/xxx-75k.json', '--image_folder', '/xxx/xxx/xxx/tmp/dataset/finetune/xxx', '--remove_unused_columns', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--freeze_merger', 'False', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', '/xxx/xxx/xxx/tmp/ckpts/finetune/model_xxx', '--num_train_epochs', '1', '--per_device_train_batch_size', '96', '--gradient_accumulation_steps', '1', '--image_min_pixels', '200704', '--image_max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '3', '--dataloader_num_workers', '2'] exits with return code = -6
Click to expand training script(sanitized)
WORKING_HOME="/xxx/xxx/xxx/tmp"
MODEL_NAME="${WORKING_HOME}/models/Qwen2.5-VL-7B-Instruct"
export PYTHONPATH=src:$PYTHONPATH
GLOBAL_BATCH_SIZE=192
BATCH_PER_DEVICE=96
NUM_DEVICES=2
GRAD_ACCUM_STEPS=$((GLOBAL_BATCH_SIZE / (BATCH_PER_DEVICE * NUM_DEVICES)))
deepspeed --include=localhost:4,5 --master_port=29522 src/train/train_sft.py \
--use_liger_kernel True \
--lora_enable False \
--use_dora False \
--lora_namespan_exclude "['lm_head', 'embed_tokens']" \
--lora_rank 32 \
--lora_alpha 64 \
--lora_dropout 0.05 \
--num_lora_modules -1 \
--deepspeed scripts/zero2.json \
--model_id $MODEL_NAME \
--data_path ${WORKING_HOME}/dataset/finetune/xxx-75k.json \
--image_folder ${WORKING_HOME}/dataset/finetune/xxx \
--remove_unused_columns False \
--freeze_vision_tower True \
--freeze_llm True \
--freeze_merger False \
--bf16 True \
--fp16 False \
--disable_flash_attn2 False \
--output_dir ${WORKING_HOME}/ckpts/finetune/model_xxx \
--num_train_epochs 1 \
--per_device_train_batch_size $BATCH_PER_DEVICE \
--gradient_accumulation_steps $GRAD_ACCUM_STEPS \
--image_min_pixels $((256 * 28 * 28)) \
--image_max_pixels $((1280 * 28 * 28)) \
--learning_rate 1e-4 \
--merger_lr 1e-5 \
--vision_lr 2e-6 \
--weight_decay 0.1 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--gradient_checkpointing True \
--report_to tensorboard \
--lazy_preprocess True \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 3 \
--dataloader_num_workers 2Request
Could you please help investigate whether this issue stems from Liger Kernelβs C++/CUDA extensions? Given the error message and reproducibility, it seems likely related to object lifetime/reference counting in a fused kernel (possibly during gradient checkpointing or fused attention/loss operations).
Let me know if you need:
- Full logs
- Liger Kernel version/commit hash
Thanks for your great work on Liger Kernel!
Reproduce
No response
Versions
Environment
- GPU: 2 Γ NVIDIA H200 (141GB VRAM), out of 8 total
- CUDA: 12.8
- Python: 3.11
- PyTorch: 2.7.1+cu128
- Transformers: 4.57.0
- Liger Kernel: 0.6.4
- DeepSpeed: with ZeRO-2 (
zero2.jsonconfig) - Model:
Qwen2.5-VL-7B-Instruct