Skip to content

[BUG] _get_num_multimodal_tokens: video branch uses undefined a) get_number_of_video_patches, b)merge_size. Tests never hit video route (multiple VLM processors) #43329

@stefgina

Description

@stefgina

System Info

  • transformers version: 4.57.3
  • Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
  • Python version: 3.12.9
  • Huggingface_hub version: 0.34.3
  • Safetensors version: 0.5.3
  • Accelerate version: 1.9.0
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: True
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • DeepSpeed version: 0.17.4
  • PyTorch version (accelerator?): 2.7.1+cu128 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: no
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@zucchini-nlp

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

These are pure python logic bugs, they exist in multiple processors.

  1. The video branch of _get_num_multimodal_tokens calls get_number_of_video_patches() but its not implemented for any of these processors.
  2. The video branch of _get_num_multimodal_tokens uses merge_size without defining it in the video branch.

Affected processors/files:

  • src/transformers/models/ernie4_5_vl_moe/processing_ernie4_5_vl_moe.py
  • src/transformers/models/glm46v/processing_glm46v.py
  • src/transformers/models/glm4v/processing_glm4v.py
  • src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py
  • src/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py
  • src/transformers/models/qwen2_vl/processing_qwen2_vl.py
  • src/transformers/models/qwen3_vl/processing_qwen3_vl.py
  • src/transformers/models/video_llama_3/processing_video_llama_3.py

Minimal repro (any affected processor):

# assumes you have a processor instance for any of the above models
processor._get_num_multimodal_tokens(video_sizes=[(8, 224, 224)])

This will results to 2 issues:

  1. First issue: AttributeError: 'Qwen3VLVideoProcessor' object has no attribute 'get_number_of_video_patches'. All of the mentioned (affected processors) use this function, but its not defined -in comparison with the image counterpart get_number_of_image_patches which exists and all tests pass -video branch doesn't have tests.

  2. Second issue: After you solve this by implementing 'get_number_of_video_patches' , then the undefined merge_size will arise: UnboundLocalError: cannot access local variable 'merge_size' where it is not associated with a value

Expected behavior

Calling _get_num_multimodal_tokens(video_sizes=...) should not crash.

The video branch should define merge_size (same as image branch) and get_number_of_video_patches (similarly as image branch).

Also in tests I think it would be a good idea to add a unit test that calls _get_num_multimodal_tokens(video_sizes=[...]) so CI actually exercises the video path.


I’m happy to send a PR with the fixes + tests.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions