-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Description
System Info
transformersversion: 4.57.3- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.12.9
- Huggingface_hub version: 0.34.3
- Safetensors version: 0.5.3
- Accelerate version: 1.9.0
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: True
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] - DeepSpeed version: 0.17.4
- PyTorch version (accelerator?): 2.7.1+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: no
- GPU type: NVIDIA H100 80GB HBM3
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
These are pure python logic bugs, they exist in multiple processors.
- The video branch of
_get_num_multimodal_tokenscallsget_number_of_video_patches()but its not implemented for any of these processors. - The video branch of
_get_num_multimodal_tokensusesmerge_sizewithout defining it in the video branch.
Affected processors/files:
src/transformers/models/ernie4_5_vl_moe/processing_ernie4_5_vl_moe.pysrc/transformers/models/glm46v/processing_glm46v.pysrc/transformers/models/glm4v/processing_glm4v.pysrc/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.pysrc/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.pysrc/transformers/models/qwen2_vl/processing_qwen2_vl.pysrc/transformers/models/qwen3_vl/processing_qwen3_vl.pysrc/transformers/models/video_llama_3/processing_video_llama_3.py
Minimal repro (any affected processor):
# assumes you have a processor instance for any of the above models
processor._get_num_multimodal_tokens(video_sizes=[(8, 224, 224)])This will results to 2 issues:
-
First issue: AttributeError:
'Qwen3VLVideoProcessor' object has no attribute 'get_number_of_video_patches'. All of the mentioned (affected processors) use this function, but its not defined -in comparison with the image counterpartget_number_of_image_patcheswhich exists and all tests pass -video branch doesn't have tests. -
Second issue: After you solve this by implementing 'get_number_of_video_patches' , then the undefined
merge_sizewill arise:UnboundLocalError: cannot access local variable 'merge_size' where it is not associated with a value
Expected behavior
Calling _get_num_multimodal_tokens(video_sizes=...) should not crash.
The video branch should define merge_size (same as image branch) and get_number_of_video_patches (similarly as image branch).
Also in tests I think it would be a good idea to add a unit test that calls _get_num_multimodal_tokens(video_sizes=[...]) so CI actually exercises the video path.
I’m happy to send a PR with the fixes + tests.