Skip to content

Conversation

@jackhu-bme
Copy link

Summary

This PR enables trajectory aggregation for multimodal models using mRoPE, which was previously disallowed due to missing multimodal state and correctness concerns during aggregation.

Motivation

Trajectory aggregation is an important feature for accelerating training in multi-turn agent reinforcement learning. Compared to transition-level aggregation, it processes an entire interaction trajectory as a single training sample, eliminating redundant computation on repeated history prefixes and significantly improving training throughput.

This aggregation strategy is commonly used in large language model pre-training and fine-tuning, and is described in more detail in the Agent Lightning documentation: https://agent-lightning.github.io/posts/trajectory_level_aggregation/

In the current implementation, trajectory aggregation is explicitly disabled for multimodal models using mRoPE via an assertion. This restriction was introduced because the aggregated trajectories did not carry sufficient multimodal state to guarantee correctness, which could lead to incorrect mRoPE position assignment when image inputs are involved.

This PR revisits that restriction by making the required multimodal information explicit and introducing additional validation to ensure correctness.

Changes Made

  • Removed the hard assertion that disabled trajectory aggregation for M-RoPE-based multimodal models, allowing trajectories to proceed when image grid metadata is available.
  • Propagated image_grid_thw during trajectory merges by deriving the multimodal grid metadata from the last merged trace and appending it to image_grid_thw_list, ensuring consistent position_ids computation and correct image token alignment.
  • Added multimodal prefix consistency checks during trajectory aggregation: trajectory merges now require both token-level prefixes and image URL prefixes to match. Image URLs are semantically normalized (e.g., file:// and data: URLs with identical content map to the same hash), and mismatches are logged for debugging.

Testing

Not run with official examples, as the current repository does not yet include a multimodal, multi-turn trajectory example that exercises trajectory aggregation.

Existing examples fall into the following categories:

  • Pure-text, multi-turn agents (e.g., conversational or planning demos), which are suitable for trajectory aggregation but do not involve multimodal inputs.
  • Multimodal examples that are single-turn or stage-based pipelines, where images are provided per step but conversation history is not fed back into subsequent turns, so trajectory-level prefix merging is not exercised.
  • Single-shot or non-conversational tasks that do not involve trajectory aggregation.

The changes were verified locally using a custom multimodal, multi-turn workflow (X-ray image input with a crop tool), where conversation history is preserved across turns. Trajectory merges succeeded only when both token prefixes and image prefixes matched, and the new validation logic behaved as expected.

A standalone multimodal trajectory example (X-ray + crop tool) can be contributed in a follow-up PR if desired.

Breaking Changes / Risks

  • No breaking changes to public APIs.
  • Trajectory aggregation for multimodal mRoPE models is now enabled under stricter validation. In rare cases where token prefixes match but image prefixes do not, aggregation will be skipped to preserve correctness.

… multimodal prefix checks for trajectory merge
Copilot AI review requested due to automatic review settings January 29, 2026 08:02
@jackhu-bme
Copy link
Author

jackhu-bme commented Jan 29, 2026

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables trajectory aggregation for multimodal models using M-RoPE (Multi-Resolution Rope) position embeddings, which was previously blocked by an assertion. The change adds multimodal state tracking and validation to ensure correctness when merging multi-turn trajectories that contain image inputs.

Changes:

  • Removed hard assertion blocking trajectory aggregation for M-RoPE models
  • Added image URL prefix matching during trajectory merge to ensure consistency across turns
  • Implemented image grid metadata propagation from merged trajectories for correct M-RoPE position computation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

try:
with open(path, "rb") as handle:
data = handle.read()
import hashlib
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement for hashlib should be placed at the top of the file with other imports, rather than being imported locally within the function. This is inconsistent with Python best practices and the codebase's import patterns.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Comment on lines +87 to +88
import base64
import hashlib
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statements for base64 and hashlib should be placed at the top of the file with other imports, rather than being imported locally within the function. This is inconsistent with Python best practices and the codebase's import patterns.

Copilot uses AI. Check for mistakes.
image_prefix_ok = image_urls_startswith(trace.get("image_urls", []), current_image_urls)
if not image_prefix_ok:
image_mismatch_count += 1
if self.trace_aggregator.get("debug", False) == True:
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition self.trace_aggregator.get("debug", False) == True is redundant. Since the get method returns a boolean, the explicit comparison to True is unnecessary and less Pythonic. The condition should be simplified to self.trace_aggregator.get("debug", False).

Copilot uses AI. Check for mistakes.
turn_index,
self.trace_aggregator.get("mismatch_log_dir", None),
)
if not token_prefix_ok and self.trace_aggregator.get("debug", False) == True:
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition self.trace_aggregator.get("debug", False) == True is redundant. Since the get method returns a boolean, the explicit comparison to True is unnecessary and less Pythonic. The condition should be simplified to self.trace_aggregator.get("debug", False). This is consistent with the same pattern that should be fixed on line 998.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant