Skip to content

Conversation

@yurekami
Copy link
Contributor

Summary

Fixes #4500

This PR fixes a RuntimeError: torch.cat(): expected a non-empty list of Tensors that occurs when processing mixed text and multi-modal batches where some micro-batches contain only text data.

Problem

When training multi-modal models with mixed datasets (some samples have images, some are text-only), text-only samples may have empty lists or empty tensors for multi-modal keys like pixel_values. The extract_multi_modal_inputs function was collecting these empty values and passing them to torch.cat(), causing the crash.

Error trace:

RuntimeError: torch.cat(): expected a non-empty list of Tensors
  at verl/utils/model.py line 748 (extract_multi_modal_inputs)

Solution

Added validation to skip empty tensors and lists when collecting multi-modal inputs:

  1. Skip tensors with numel() == 0
  2. Skip empty lists/tuples with len() == 0
  3. Add safety check before torch.cat to handle edge cases

Changes

  • verl/utils/model.py: Updated extract_multi_modal_inputs() function

Example

# Before: This would crash
batch_data = [
    {"pixel_values": torch.tensor([...])},  # Multi-modal sample
    {"pixel_values": []},  # Text-only sample - empty list
    {"pixel_values": None},  # Text-only sample - None
]
result = extract_multi_modal_inputs(batch_data)  # RuntimeError!

# After: Works correctly
result = extract_multi_modal_inputs(batch_data)  # Returns concatenated tensors

Test plan

  • Syntax validation passes
  • Unit test with mixed text/multi-modal batch
  • Integration test with SFT training on mixed dataset

🤖 Generated with Claude Code

Fixes volcengine#4500

When processing mixed text and multi-modal batches, text-only samples
may have empty lists or empty tensors for multi-modal keys (e.g.,
pixel_values=[]). Previously, these were still collected and passed
to torch.cat(), causing RuntimeError: "expected a non-empty list of
Tensors".

Changes:
- Skip empty tensors (numel() == 0) when collecting multi-modal inputs
- Skip empty lists/tuples when collecting multi-modal inputs
- Add safety check before torch.cat to skip keys with no valid tensors

This allows mixed text/multi-modal batches to be processed without
errors when some micro-batches contain only text data.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request modifies the extract_multi_modal_inputs function in verl/utils/model.py to enhance input filtering. It now explicitly skips None values, empty torch.Tensor objects, and empty lists/tuples when collecting multi-modal inputs. This change aims to correctly handle mixed batches that may contain text-only samples, preventing empty multi-modal inputs from being processed. Additionally, a new check was added to skip processing any collected multi-modal input key if no valid values were appended to its list.

@wuxibin89 wuxibin89 changed the title fix(utils): handle empty tensors in extract_multi_modal_inputs [trainer] fix: handle empty tensors in extract_multi_modal_inputs Dec 29, 2025
@wuxibin89 wuxibin89 closed this Dec 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RuntimeError: torch.cat(): expected a non-empty list of Tensors

2 participants