[trainer] fix: handle empty tensors in extract_multi_modal_inputs#4705
Closed
yurekami wants to merge 1 commit intoverl-project:mainfrom
Closed
[trainer] fix: handle empty tensors in extract_multi_modal_inputs#4705yurekami wants to merge 1 commit intoverl-project:mainfrom
yurekami wants to merge 1 commit intoverl-project:mainfrom
Conversation
Fixes verl-project#4500 When processing mixed text and multi-modal batches, text-only samples may have empty lists or empty tensors for multi-modal keys (e.g., pixel_values=[]). Previously, these were still collected and passed to torch.cat(), causing RuntimeError: "expected a non-empty list of Tensors". Changes: - Skip empty tensors (numel() == 0) when collecting multi-modal inputs - Skip empty lists/tuples when collecting multi-modal inputs - Add safety check before torch.cat to skip keys with no valid tensors This allows mixed text/multi-modal batches to be processed without errors when some micro-batches contain only text data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
The pull request modifies the extract_multi_modal_inputs function in verl/utils/model.py to enhance input filtering. It now explicitly skips None values, empty torch.Tensor objects, and empty lists/tuples when collecting multi-modal inputs. This change aims to correctly handle mixed batches that may contain text-only samples, preventing empty multi-modal inputs from being processed. Additionally, a new check was added to skip processing any collected multi-modal input key if no valid values were appended to its list.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4500
This PR fixes a
RuntimeError: torch.cat(): expected a non-empty list of Tensorsthat occurs when processing mixed text and multi-modal batches where some micro-batches contain only text data.Problem
When training multi-modal models with mixed datasets (some samples have images, some are text-only), text-only samples may have empty lists or empty tensors for multi-modal keys like
pixel_values. Theextract_multi_modal_inputsfunction was collecting these empty values and passing them totorch.cat(), causing the crash.Error trace:
Solution
Added validation to skip empty tensors and lists when collecting multi-modal inputs:
numel() == 0len() == 0torch.catto handle edge casesChanges
verl/utils/model.py: Updatedextract_multi_modal_inputs()functionExample
Test plan
🤖 Generated with Claude Code