Skip to content

Conversation

@hallerite
Copy link
Contributor

@hallerite hallerite commented Feb 1, 2026

add experimental support for doing multi-turn RL with VLMs


Note

Medium Risk
Touches the multimodal data path that builds pixel_values/image_grid_thw for training samples; mistakes can silently misalign images with tokens across turns and degrade RL training stability.

Overview
Adds multi-turn multimodal (VLM) rollout support by extending the VLM image cache to extract and preprocess images across all trajectory steps, then supplying per-step cumulative pixel_values/image_grid_thw to branch_rollout (while interleave_rollout uses the final step’s images).

Updates the orchestrator rollout-processing path to pass the shared vlm_cache object into rollout conversion, selects the longest trajectory per example_id when building the cache, and expands unit tests to cover multi-turn image extraction, cache accessors, and rollout image assignment; docs now note higher KL mismatch for multi-image inputs.

Written by Cursor Bugbot for commit fb42176. This will update automatically on new commits. Configure here.

@hallerite hallerite force-pushed the hallerite/multimodal branch from b47a19f to 2984d63 Compare February 1, 2026 04:43
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants