-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support of trajectory aggregation for mrope multimodal model, and add multimodal prefix checks for trajectory merge #469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… multimodal prefix checks for trajectory merge
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables trajectory aggregation for multimodal models using M-RoPE (Multi-Resolution Rope) position embeddings, which was previously blocked by an assertion. The change adds multimodal state tracking and validation to ensure correctness when merging multi-turn trajectories that contain image inputs.
Changes:
- Removed hard assertion blocking trajectory aggregation for M-RoPE models
- Added image URL prefix matching during trajectory merge to ensure consistency across turns
- Implemented image grid metadata propagation from merged trajectories for correct M-RoPE position computation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| with open(path, "rb") as handle: | ||
| data = handle.read() | ||
| import hashlib |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statement for hashlib should be placed at the top of the file with other imports, rather than being imported locally within the function. This is inconsistent with Python best practices and the codebase's import patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
| import base64 | ||
| import hashlib |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import statements for base64 and hashlib should be placed at the top of the file with other imports, rather than being imported locally within the function. This is inconsistent with Python best practices and the codebase's import patterns.
| image_prefix_ok = image_urls_startswith(trace.get("image_urls", []), current_image_urls) | ||
| if not image_prefix_ok: | ||
| image_mismatch_count += 1 | ||
| if self.trace_aggregator.get("debug", False) == True: |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition self.trace_aggregator.get("debug", False) == True is redundant. Since the get method returns a boolean, the explicit comparison to True is unnecessary and less Pythonic. The condition should be simplified to self.trace_aggregator.get("debug", False).
| turn_index, | ||
| self.trace_aggregator.get("mismatch_log_dir", None), | ||
| ) | ||
| if not token_prefix_ok and self.trace_aggregator.get("debug", False) == True: |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition self.trace_aggregator.get("debug", False) == True is redundant. Since the get method returns a boolean, the explicit comparison to True is unnecessary and less Pythonic. The condition should be simplified to self.trace_aggregator.get("debug", False). This is consistent with the same pattern that should be fixed on line 998.
Summary
This PR enables trajectory aggregation for multimodal models using mRoPE, which was previously disallowed due to missing multimodal state and correctness concerns during aggregation.
Motivation
Trajectory aggregation is an important feature for accelerating training in multi-turn agent reinforcement learning. Compared to transition-level aggregation, it processes an entire interaction trajectory as a single training sample, eliminating redundant computation on repeated history prefixes and significantly improving training throughput.
This aggregation strategy is commonly used in large language model pre-training and fine-tuning, and is described in more detail in the Agent Lightning documentation: https://agent-lightning.github.io/posts/trajectory_level_aggregation/
In the current implementation, trajectory aggregation is explicitly disabled for multimodal models using mRoPE via an assertion. This restriction was introduced because the aggregated trajectories did not carry sufficient multimodal state to guarantee correctness, which could lead to incorrect mRoPE position assignment when image inputs are involved.
This PR revisits that restriction by making the required multimodal information explicit and introducing additional validation to ensure correctness.
Changes Made
image_grid_thwduring trajectory merges by deriving the multimodal grid metadata from the last merged trace and appending it toimage_grid_thw_list, ensuring consistentposition_idscomputation and correct image token alignment.file://anddata:URLs with identical content map to the same hash), and mismatches are logged for debugging.Testing
Not run with official examples, as the current repository does not yet include a multimodal, multi-turn trajectory example that exercises trajectory aggregation.
Existing examples fall into the following categories:
The changes were verified locally using a custom multimodal, multi-turn workflow (X-ray image input with a crop tool), where conversation history is preserved across turns. Trajectory merges succeeded only when both token prefixes and image prefixes matched, and the new validation logic behaved as expected.
A standalone multimodal trajectory example (X-ray + crop tool) can be contributed in a follow-up PR if desired.
Breaking Changes / Risks