Skip to content

Fatal Python error when using Liger Kernel with Qwen-VL fine-tuning: none_dealloc: deallocating None at step ~875Β #975

@lxy21214

Description

@lxy21214

πŸ› Describe the bug

Hi, I'm encountering a consistent crash when enabling use_liger_kernel=True during fine-tuning of the Qwen2.5-VL-7B-Instruct model using the 2U1/Qwen-VL-Series-Finetune codebase. The crash occurs exactly around step 875 (and again at step 1675, 2475 when resuming from a checkpoint at step 800, 1600), with the following fatal error:

Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension

This error suggests a reference counting bug in a C extensionβ€”potentially within Liger Kernel itself.

Reproduction

  • The crash is highly reproducible: every run with --use_liger_kernel True fails at step 875 .
  • The issue appears during backward pass, as indicated by the stack trace pointing into torch/autograd and torch/utils/checkpoint.py.
Click to expand error log (sanitized)
{'loss': 1.3778, 'grad_norm': 0.23937849700450897, 'learning_rate': 8.577619427099825e-05, 'epoch': 0.27}

 27%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                                                                                       | 874/3246 [1:30:51<3:30:41,  5.33s/it]
 27%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                                                                                       | 875/3246 [1:30:56<3:29:14,  5.30s/it]
                                                                                                                                                                    
{'loss': 1.434, 'grad_norm': 0.3069251477718353, 'learning_rate': 8.574131814228779e-05, 'epoch': 0.27}

 27%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž                                                                                       | 875/3246 [1:30:56<3:29:14,  5.30s/it]Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized

Current thread 0x00007f9ddc9fc640 (most recent call first):
  <no Python frame>

Thread 0x00007f9d689c4640 (most recent call first):
  <no Python frame>

Thread 0x00007f9d681c3640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9d679c2640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9d671c1640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37 in do_one_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 61 in _pin_memory_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9aa89a7640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 331 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 629 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9ee10ee740 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/graph.py", line 824 in _engine_run_backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/_tensor.py", line 648 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2126 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18 in wrapped_fn
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 270 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/accelerator.py", line 2844 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 4071 in training_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2674 in _inner_training_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2325 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 265 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 291 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imaging, gmpy2.gmpy2, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, _brotli, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, ujson, cuda_utils, msgpack._cmsgpack, __triton_launcher (total: 149)
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized

Thread 0x00007fafa71fc640 (most recent call first):
  <no Python frame>

Current thread 0x00007faf491c4640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/function.py", line 307 in apply

Thread 0x00007faf449c3640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007faf401c2640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007faf3f9c1640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37 in do_one_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 61 in _pin_memory_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fabe17fe640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fabe1fff640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tensorboardX/event_file_writer.py", line 202 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fac67fff640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 331 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 629 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fb0ab8a6740 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/graph.py", line 824 in _engine_run_backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/_tensor.py", line 648 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2126 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18 in wrapped_fn
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 270 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/accelerator.py", line 2844 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 4071 in training_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2674 in _inner_training_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2325 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 265 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 291 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imaging, gmpy2.gmpy2, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, _brotli, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, ujson, cuda_utils, msgpack._cmsgpack, __triton_launcher (total: 149)
[2025-12-12 11:32:43,783] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1733631
[2025-12-12 11:33:13,808] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1733634
[2025-12-12 11:33:13,814] [ERROR] [launch.py:325:sigkill_handler] ['/xxx/xxx/xxx/tmp/venvs/tf450/bin/python3.11', '-u', 'src/train/train_sft.py', '--local_rank=1', '--use_liger_kernel', 'True', '--lora_enable', 'False', '--use_dora', 'False', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '32', '--lora_alpha', '64', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', '/xxx/xxx/xxx/tmp/models/Qwen2.5-VL-7B-Instruct', '--data_path', '/xxx/xxx/xxx/tmp/dataset/finetune/xxx-75k.json', '--image_folder', '/xxx/xxx/xxx/tmp/dataset/finetune/xxx', '--remove_unused_columns', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--freeze_merger', 'False', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', '/xxx/xxx/xxx/tmp/ckpts/finetune/model_xxx', '--num_train_epochs', '1', '--per_device_train_batch_size', '96', '--gradient_accumulation_steps', '1', '--image_min_pixels', '200704', '--image_max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '3', '--dataloader_num_workers', '2'] exits with return code = -6
Click to expand training script(sanitized)
WORKING_HOME="/xxx/xxx/xxx/tmp"
MODEL_NAME="${WORKING_HOME}/models/Qwen2.5-VL-7B-Instruct"

export PYTHONPATH=src:$PYTHONPATH

GLOBAL_BATCH_SIZE=192
BATCH_PER_DEVICE=96
NUM_DEVICES=2
GRAD_ACCUM_STEPS=$((GLOBAL_BATCH_SIZE / (BATCH_PER_DEVICE * NUM_DEVICES)))

deepspeed --include=localhost:4,5 --master_port=29522 src/train/train_sft.py \
    --use_liger_kernel True \
    --lora_enable False \
    --use_dora False \
    --lora_namespan_exclude "['lm_head', 'embed_tokens']" \
    --lora_rank 32 \
    --lora_alpha 64 \
    --lora_dropout 0.05 \
    --num_lora_modules -1 \
    --deepspeed scripts/zero2.json \
    --model_id $MODEL_NAME \
    --data_path ${WORKING_HOME}/dataset/finetune/xxx-75k.json \
    --image_folder ${WORKING_HOME}/dataset/finetune/xxx \
    --remove_unused_columns False \
    --freeze_vision_tower True \
    --freeze_llm True \
    --freeze_merger False \
    --bf16 True \
    --fp16 False \
    --disable_flash_attn2 False \
    --output_dir ${WORKING_HOME}/ckpts/finetune/model_xxx \
    --num_train_epochs 1 \
    --per_device_train_batch_size $BATCH_PER_DEVICE \
    --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
    --image_min_pixels $((256 * 28 * 28)) \
    --image_max_pixels $((1280 * 28 * 28)) \
    --learning_rate 1e-4 \
    --merger_lr 1e-5 \
    --vision_lr 2e-6 \
    --weight_decay 0.1 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --gradient_checkpointing True \
    --report_to tensorboard \
    --lazy_preprocess True \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 3 \
    --dataloader_num_workers 2

Request

Could you please help investigate whether this issue stems from Liger Kernel’s C++/CUDA extensions? Given the error message and reproducibility, it seems likely related to object lifetime/reference counting in a fused kernel (possibly during gradient checkpointing or fused attention/loss operations).

Let me know if you need:

  • Full logs
  • Liger Kernel version/commit hash

Thanks for your great work on Liger Kernel!

Reproduce

No response

Versions

Environment

  • GPU: 2 Γ— NVIDIA H200 (141GB VRAM), out of 8 total
  • CUDA: 12.8
  • Python: 3.11
  • PyTorch: 2.7.1+cu128
  • Transformers: 4.57.0
  • Liger Kernel: 0.6.4
  • DeepSpeed: with ZeRO-2 (zero2.json config)
  • Model: Qwen2.5-VL-7B-Instruct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions