[trainer] fix: handle empty tensors in extract_multi_modal_inputs #4705
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4500
This PR fixes a
RuntimeError: torch.cat(): expected a non-empty list of Tensorsthat occurs when processing mixed text and multi-modal batches where some micro-batches contain only text data.Problem
When training multi-modal models with mixed datasets (some samples have images, some are text-only), text-only samples may have empty lists or empty tensors for multi-modal keys like
pixel_values. Theextract_multi_modal_inputsfunction was collecting these empty values and passing them totorch.cat(), causing the crash.Error trace:
Solution
Added validation to skip empty tensors and lists when collecting multi-modal inputs:
numel() == 0len() == 0torch.catto handle edge casesChanges
verl/utils/model.py: Updatedextract_multi_modal_inputs()functionExample
Test plan
🤖 Generated with Claude Code