Skip to content

Fix IDs shape mismatch in SFT for VLMs with text-only#5354

Merged
albertvillanova merged 7 commits intohuggingface:mainfrom
albertvillanova:fix-5334
Mar 24, 2026
Merged

Fix IDs shape mismatch in SFT for VLMs with text-only#5354
albertvillanova merged 7 commits intohuggingface:mainfrom
albertvillanova:fix-5334

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Mar 23, 2026

Fix IDs shape mismatch in SFT for VLMs with text-only.

Fix #5334.

This PR addresses a regression issue when training vision-language models (VLMs) with text-only datasets, ensuring compatibility between data preprocessing and model expectations. The main focus is on fixing how input IDs are handled for VLMs and adding a regression test to prevent future breakage.

Changes

Bug fix for VLM text-only input handling:

  • Fixed an inconsistency in tokenize_fn where VLM processors returned input IDs as a list of lists (e.g., [[1, 2, 3]]) instead of a flat list (e.g., [1, 2, 3]). The function now unwraps the extra list level to prevent downstream shape errors in models expecting 3-D position IDs.

Testing improvements:

  • Added a regression test in test_sft_trainer.py to verify that training a VLM with a text-only dataset works correctly and does not produce shape errors.

Note

Medium Risk
Touches SFT dataset tokenization for standard (non-conversational) examples; a small shape-normalization change could affect any processor that returns nested input_ids, but it is guarded by a targeted regression test.

Overview
Fixes a regression when training vision-language models on text-only standard datasets by normalizing input_ids returned from VLM processing_class calls (unwrapping [[...]] to [...]) to match the LLM code path and avoid downstream shape/position-id errors.

Adds a regression case to test_train_vlm_text_only_data to include standard_language_modeling in the parameterized dataset configs, ensuring VLM text-only training remains supported.

Written by Cursor Bugbot for commit 7b1acb1. This will update automatically on new commits. Configure here.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member

For the records: we do not have the same issue in DPO (text-only data is properly supported), neither in RewardTrainer which doesn't support VLM

@albertvillanova albertvillanova merged commit ee77df9 into huggingface:main Mar 24, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3.5 input embeddings have too many values to unpack with SFTTrainer

3 participants