Skip to content

It seems that video content is necessary? #237

@nightrain-vampire

Description

@nightrain-vampire

Hello, I am now running GRPO + lora in Qwen3_VL_8B_Instruct, but I met problems as follows:

Traceback (most recent call last):
  File "/mnt/user3/TopicC/Qwen-VL-Series-Finetune/src/train/train_grpo.py", line 285, in <module>
    train()
  File "/mnt/user3/TopicC/Qwen-VL-Series-Finetune/src/train/train_grpo.py", line 259, in train
    trainer.train()
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/trainer.py", line 2325, in train
    return inner_training_loop(
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/trainer.py", line 2674, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/trainer.py", line 4014, in training_step
    inputs = self._prepare_inputs(inputs)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/trl/extras/profiling.py", line 98, in wrapper
    return func(self, *args, **kwargs)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 1067, in _prepare_inputs
    generation_batch = self._generate_and_score_completions(generation_batch)
  File "/mnt/user3/TopicC/Qwen-VL-Series-Finetune/src/trainer/grpo_trainer.py", line 160, in _generate_and_score_completions
    prompt_inputs = self.processing_class(**processor_kwargs)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/models/qwen3_vl/processing_qwen3_vl.py", line 170, in __call__
    videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/video_processing_utils.py", line 206, in __call__
    return self.preprocess(videos, **kwargs)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/video_processing_utils.py", line 372, in preprocess
    videos, video_metadata = self._decode_and_sample_videos(
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/video_processing_utils.py", line 296, in _decode_and_sample_videos
    videos = make_batched_videos(videos)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/video_utils.py", line 216, in make_batched_videos
    flat_videos_list = convert_pil_frames_to_video(flat_videos_list)
  File "/data/user3/miniconda3/envs/miqa/lib/python3.10/site-packages/transformers/video_utils.py", line 165, in convert_pil_frames_to_video
    if not (isinstance(videos[0], (list, tuple)) and is_valid_image(videos[0][0])):
IndexError: list index out of range

My training set is all image data without any video. However, the code seems to be trying to process video data, hence the error. How can I resolve this? Or does it currently only support mixed image and video input?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions