Fatal Python error when using Liger Kernel with Qwen-VL fine-tuning: `none_dealloc: deallocating None` at step ~875

### 🐛 Describe the bug

Hi, I'm encountering a consistent crash when enabling `use_liger_kernel=True` during fine-tuning of the **Qwen2.5-VL-7B-Instruct** model using the [2U1/Qwen-VL-Series-Finetune](https://github.com/2U1/Qwen-VL-Series-Finetune) codebase. The crash occurs **exactly around step 875** (and again at step 1675, 2475 when resuming from a checkpoint at step 800, 1600), with the following fatal error:

```
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
```

This error suggests a reference counting bug in a C extension—potentially within Liger Kernel itself.

### Reproduction
- The crash is **highly reproducible**: every run with `--use_liger_kernel True` fails at **step 875** .
- The issue appears during **backward pass**, as indicated by the stack trace pointing into `torch/autograd` and `torch/utils/checkpoint.py`.

<details>
<summary>Click to expand error log (sanitized)</summary>

``` 
{'loss': 1.3778, 'grad_norm': 0.23937849700450897, 'learning_rate': 8.577619427099825e-05, 'epoch': 0.27}

 27%|████████████████████████████████▎                                                                                       | 874/3246 [1:30:51<3:30:41,  5.33s/it]
 27%|████████████████████████████████▎                                                                                       | 875/3246 [1:30:56<3:29:14,  5.30s/it]
                                                                                                                                                                    
{'loss': 1.434, 'grad_norm': 0.3069251477718353, 'learning_rate': 8.574131814228779e-05, 'epoch': 0.27}

 27%|████████████████████████████████▎                                                                                       | 875/3246 [1:30:56<3:29:14,  5.30s/it]Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized

Current thread 0x00007f9ddc9fc640 (most recent call first):
  <no Python frame>

Thread 0x00007f9d689c4640 (most recent call first):
  <no Python frame>

Thread 0x00007f9d681c3640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9d679c2640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9d671c1640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37 in do_one_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 61 in _pin_memory_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9aa89a7640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 331 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 629 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f9ee10ee740 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/graph.py", line 824 in _engine_run_backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/_tensor.py", line 648 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2126 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18 in wrapped_fn
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 270 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/accelerator.py", line 2844 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 4071 in training_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2674 in _inner_training_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2325 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 265 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 291 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imaging, gmpy2.gmpy2, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, _brotli, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, ujson, cuda_utils, msgpack._cmsgpack, __triton_launcher (total: 149)
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: initialized

Thread 0x00007fafa71fc640 (most recent call first):
  <no Python frame>

Current thread 0x00007faf491c4640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/function.py", line 307 in apply

Thread 0x00007faf449c3640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007faf401c2640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007faf3f9c1640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 37 in do_one_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 61 in _pin_memory_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fabe17fe640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 327 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 982 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fabe1fff640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/selectors.py", line 415 in select
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 948 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tensorboardX/event_file_writer.py", line 202 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fac67fff640 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 331 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 629 in wait
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007fb0ab8a6740 (most recent call first):
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/graph.py", line 824 in _engine_run_backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/autograd/__init__.py", line 353 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/torch/_tensor.py", line 648 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2126 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18 in wrapped_fn
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 270 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/accelerate/accelerator.py", line 2844 in backward
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 4071 in training_step
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2674 in _inner_training_loop
  File "/xxx/xxx/xxx/tmp/venvs/tf450/lib/python3.11/site-packages/transformers/trainer.py", line 2325 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 265 in train
  File "/xxx/xxx/xxx/tmp/src/Qwen-VL-Series-Finetune/src/train/train_sft.py", line 291 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, _cffi_backend, yaml._yaml, regex._regex, markupsafe._speedups, PIL._imaging, gmpy2.gmpy2, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, _brotli, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._json, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, ujson, cuda_utils, msgpack._cmsgpack, __triton_launcher (total: 149)
[2025-12-12 11:32:43,783] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1733631
[2025-12-12 11:33:13,808] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1733634
[2025-12-12 11:33:13,814] [ERROR] [launch.py:325:sigkill_handler] ['/xxx/xxx/xxx/tmp/venvs/tf450/bin/python3.11', '-u', 'src/train/train_sft.py', '--local_rank=1', '--use_liger_kernel', 'True', '--lora_enable', 'False', '--use_dora', 'False', '--lora_namespan_exclude', "['lm_head', 'embed_tokens']", '--lora_rank', '32', '--lora_alpha', '64', '--lora_dropout', '0.05', '--num_lora_modules', '-1', '--deepspeed', 'scripts/zero2.json', '--model_id', '/xxx/xxx/xxx/tmp/models/Qwen2.5-VL-7B-Instruct', '--data_path', '/xxx/xxx/xxx/tmp/dataset/finetune/xxx-75k.json', '--image_folder', '/xxx/xxx/xxx/tmp/dataset/finetune/xxx', '--remove_unused_columns', 'False', '--freeze_vision_tower', 'True', '--freeze_llm', 'True', '--freeze_merger', 'False', '--bf16', 'True', '--fp16', 'False', '--disable_flash_attn2', 'False', '--output_dir', '/xxx/xxx/xxx/tmp/ckpts/finetune/model_xxx', '--num_train_epochs', '1', '--per_device_train_batch_size', '96', '--gradient_accumulation_steps', '1', '--image_min_pixels', '200704', '--image_max_pixels', '1003520', '--learning_rate', '1e-4', '--merger_lr', '1e-5', '--vision_lr', '2e-6', '--weight_decay', '0.1', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--gradient_checkpointing', 'True', '--report_to', 'tensorboard', '--lazy_preprocess', 'True', '--save_strategy', 'steps', '--save_steps', '100', '--save_total_limit', '3', '--dataloader_num_workers', '2'] exits with return code = -6
```
</details>


<details>
<summary>Click to expand training script(sanitized) </summary>

```bash
WORKING_HOME="/xxx/xxx/xxx/tmp"
MODEL_NAME="${WORKING_HOME}/models/Qwen2.5-VL-7B-Instruct"

export PYTHONPATH=src:$PYTHONPATH

GLOBAL_BATCH_SIZE=192
BATCH_PER_DEVICE=96
NUM_DEVICES=2
GRAD_ACCUM_STEPS=$((GLOBAL_BATCH_SIZE / (BATCH_PER_DEVICE * NUM_DEVICES)))

deepspeed --include=localhost:4,5 --master_port=29522 src/train/train_sft.py \
    --use_liger_kernel True \
    --lora_enable False \
    --use_dora False \
    --lora_namespan_exclude "['lm_head', 'embed_tokens']" \
    --lora_rank 32 \
    --lora_alpha 64 \
    --lora_dropout 0.05 \
    --num_lora_modules -1 \
    --deepspeed scripts/zero2.json \
    --model_id $MODEL_NAME \
    --data_path ${WORKING_HOME}/dataset/finetune/xxx-75k.json \
    --image_folder ${WORKING_HOME}/dataset/finetune/xxx \
    --remove_unused_columns False \
    --freeze_vision_tower True \
    --freeze_llm True \
    --freeze_merger False \
    --bf16 True \
    --fp16 False \
    --disable_flash_attn2 False \
    --output_dir ${WORKING_HOME}/ckpts/finetune/model_xxx \
    --num_train_epochs 1 \
    --per_device_train_batch_size $BATCH_PER_DEVICE \
    --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
    --image_min_pixels $((256 * 28 * 28)) \
    --image_max_pixels $((1280 * 28 * 28)) \
    --learning_rate 1e-4 \
    --merger_lr 1e-5 \
    --vision_lr 2e-6 \
    --weight_decay 0.1 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --gradient_checkpointing True \
    --report_to tensorboard \
    --lazy_preprocess True \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 3 \
    --dataloader_num_workers 2
```
</details>

### Request
Could you please help investigate whether this issue stems from Liger Kernel’s C++/CUDA extensions? Given the error message and reproducibility, it seems likely related to object lifetime/reference counting in a fused kernel (possibly during gradient checkpointing or fused attention/loss operations).

Let me know if you need:
- Full logs
- Liger Kernel version/commit hash

Thanks for your great work on Liger Kernel!

### Reproduce

_No response_

### Versions

### Environment
- **GPU**: 2 × NVIDIA H200 (141GB VRAM), out of 8 total  
- **CUDA**: 12.8  
- **Python**: 3.11  
- **PyTorch**: 2.7.1+cu128  
- **Transformers**: 4.57.0   
- **Liger Kernel**: 0.6.4
- **DeepSpeed**: with ZeRO-2 (`zero2.json` config)  
- **Model**: `Qwen2.5-VL-7B-Instruct`  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fatal Python error when using Liger Kernel with Qwen-VL fine-tuning: `none_dealloc: deallocating None` at step ~875 #975

🐛 Describe the bug

Reproduction

Request

Reproduce

Versions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fatal Python error when using Liger Kernel with Qwen-VL fine-tuning: none_dealloc: deallocating None at step ~875 #975

Description

🐛 Describe the bug

Reproduction

Request

Reproduce

Versions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Fatal Python error when using Liger Kernel with Qwen-VL fine-tuning: `none_dealloc: deallocating None` at step ~875 #975