add multi-turn support for multi-modal RL #1703
Open
+514
−73
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
add experimental support for doing multi-turn RL with VLMs
Note
Medium Risk
Touches the multimodal data path that builds
pixel_values/image_grid_thwfor training samples; mistakes can silently misalign images with tokens across turns and degrade RL training stability.Overview
Adds multi-turn multimodal (VLM) rollout support by extending the VLM image cache to extract and preprocess images across all trajectory steps, then supplying per-step cumulative
pixel_values/image_grid_thwtobranch_rollout(whileinterleave_rolloutuses the final step’s images).Updates the orchestrator rollout-processing path to pass the shared
vlm_cacheobject into rollout conversion, selects the longest trajectory perexample_idwhen building the cache, and expands unit tests to cover multi-turn image extraction, cache accessors, and rollout image assignment; docs now note higher KL mismatch for multi-image inputs.Written by Cursor Bugbot for commit fb42176. This will update automatically on new commits. Configure here.