[trainer] fix: handle empty tensors in extract_multi_modal_inputs #4705

yurekami · 2025-12-28T16:18:21Z

Summary

This PR fixes a RuntimeError: torch.cat(): expected a non-empty list of Tensors that occurs when processing mixed text and multi-modal batches where some micro-batches contain only text data.

Problem

When training multi-modal models with mixed datasets (some samples have images, some are text-only), text-only samples may have empty lists or empty tensors for multi-modal keys like pixel_values. The extract_multi_modal_inputs function was collecting these empty values and passing them to torch.cat(), causing the crash.

Error trace:

RuntimeError: torch.cat(): expected a non-empty list of Tensors
  at verl/utils/model.py line 748 (extract_multi_modal_inputs)

Solution

Added validation to skip empty tensors and lists when collecting multi-modal inputs:

Skip tensors with numel() == 0
Skip empty lists/tuples with len() == 0
Add safety check before torch.cat to handle edge cases

Changes

verl/utils/model.py: Updated extract_multi_modal_inputs() function

Example

# Before: This would crash
batch_data = [
    {"pixel_values": torch.tensor([...])},  # Multi-modal sample
    {"pixel_values": []},  # Text-only sample - empty list
    {"pixel_values": None},  # Text-only sample - None
]
result = extract_multi_modal_inputs(batch_data)  # RuntimeError!

# After: Works correctly
result = extract_multi_modal_inputs(batch_data)  # Returns concatenated tensors

Test plan

Syntax validation passes
Unit test with mixed text/multi-modal batch
Integration test with SFT training on mixed dataset

🤖 Generated with Claude Code

Fixes volcengine#4500 When processing mixed text and multi-modal batches, text-only samples may have empty lists or empty tensors for multi-modal keys (e.g., pixel_values=[]). Previously, these were still collected and passed to torch.cat(), causing RuntimeError: "expected a non-empty list of Tensors". Changes: - Skip empty tensors (numel() == 0) when collecting multi-modal inputs - Skip empty lists/tuples when collecting multi-modal inputs - Add safety check before torch.cat to skip keys with no valid tensors This allows mixed text/multi-modal batches to be processed without errors when some micro-batches contain only text data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist

Code Review

The pull request modifies the extract_multi_modal_inputs function in verl/utils/model.py to enhance input filtering. It now explicitly skips None values, empty torch.Tensor objects, and empty lists/tuples when collecting multi-modal inputs. This change aims to correctly handle mixed batches that may contain text-only samples, preventing empty multi-modal inputs from being processed. Additionally, a new check was added to skip processing any collected multi-modal input key if no valid values were appended to its list.

gemini-code-assist bot reviewed Dec 28, 2025

View reviewed changes

wuxibin89 changed the title ~~fix(utils): handle empty tensors in extract_multi_modal_inputs~~ [trainer] fix: handle empty tensors in extract_multi_modal_inputs Dec 29, 2025

wuxibin89 closed this Dec 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] fix: handle empty tensors in extract_multi_modal_inputs #4705

[trainer] fix: handle empty tensors in extract_multi_modal_inputs #4705

Uh oh!

yurekami commented Dec 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[trainer] fix: handle empty tensors in extract_multi_modal_inputs #4705

[trainer] fix: handle empty tensors in extract_multi_modal_inputs #4705

Uh oh!

Conversation

yurekami commented Dec 28, 2025

Summary

Problem

Solution

Changes

Example

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants