-
Notifications
You must be signed in to change notification settings - Fork 136
[model] refactor: Compatible with Transformers v5 RoPE #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors several models to be compatible with the RoPE implementation in transformers v5. The changes mostly involve updating the RotaryEmbedding classes to use the new configuration structure and utility functions. While most models are updated correctly, there are some critical issues with models that use custom 3D position_ids for RoPE.
Specifically:
- For
qwen2_5vlandqwen2_vlmodels, the dynamic RoPE scaling functionality has been removed. - For
qwen3_vlandqwen3_vl_moemodels, the new@dynamic_rope_updatedecorator fromtransformersis used, but it is incompatible with the 3Dposition_idsof these models, which will likely lead to runtime errors.
These issues should be addressed to ensure correct functionality and avoid regressions.
| return inv_freq, attention_factor | ||
|
|
||
| @torch.no_grad() | ||
| @dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @dynamic_rope_update decorator from transformers.modeling_rope_utils is not compatible with the 3D position_ids tensor of shape (3, batch_size, sequence_length) used in this model's forward method. The decorator's internal logic expects a 2D position_ids tensor to calculate the sequence length, and will likely fail or produce incorrect results with a 3D tensor. This could lead to runtime errors or incorrect RoPE scaling when dynamic scaling is triggered.
A custom implementation for dynamic frequency updates that can handle 3D position_ids is required if this feature is to be supported.
| return inv_freq, attention_factor | ||
|
|
||
| @torch.no_grad() | ||
| @dynamic_rope_update # power user: used with advanced RoPE types (e.g. dynamic rope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @dynamic_rope_update decorator is incompatible with the 3D position_ids tensor used in Qwen3VLMoeTextRotaryEmbedding. The transformers utility expects a 2D tensor to correctly infer sequence length for dynamic scaling. Using it with a 3D tensor will likely cause runtime errors or incorrect scaling behavior. If dynamic RoPE scaling is needed, a custom update mechanism that correctly handles the 3D position_ids is necessary.
| def __init__(self, config: Qwen2_5_VLConfig, device=None): | ||
| super().__init__() | ||
| # BC: "rope_type" was originally "type" | ||
| if hasattr(config, "rope_scaling") and config.rope_scaling is not None: | ||
| self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type")) | ||
| else: | ||
| self.rope_type = "default" | ||
| self.max_seq_len_cached = config.max_position_embeddings | ||
| self.original_max_seq_len = config.max_position_embeddings | ||
|
|
||
| self.config = config | ||
| self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type] | ||
|
|
||
| inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device) | ||
| self.rope_type = self.config.rope_parameters["rope_type"] | ||
| rope_init_fn: Callable = self.compute_default_rope_parameters | ||
| if self.rope_type != "default": | ||
| rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type] | ||
| inv_freq, self.attention_scaling = rope_init_fn(self.config, device) | ||
|
|
||
| self.register_buffer("inv_freq", inv_freq, persistent=False) | ||
| self.original_inv_freq = self.inv_freq | ||
| self.original_inv_freq = inv_freq | ||
|
|
||
| def _dynamic_frequency_update(self, position_ids, device): | ||
| @staticmethod | ||
| def compute_default_rope_parameters( | ||
| config: Optional[Qwen2_5_VLConfig] = None, | ||
| device: Optional["torch.device"] = None, | ||
| seq_len: Optional[int] = None, | ||
| ) -> tuple["torch.Tensor", float]: | ||
| """ | ||
| dynamic RoPE layers should recompute `inv_freq` in the following situations: | ||
| 1 - growing beyond the cached sequence length (allow scaling) | ||
| 2 - the current sequence length is in the original scale (avoid losing precision with small sequences) | ||
| Computes the inverse frequencies according to the original RoPE implementation | ||
| Args: | ||
| config ([`~transformers.PreTrainedConfig`]): | ||
| The model configuration. | ||
| device (`torch.device`): | ||
| The device to use for initialization of the inverse frequencies. | ||
| seq_len (`int`, *optional*): | ||
| The current sequence length. Unused for this type of RoPE. | ||
| Returns: | ||
| Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the | ||
| post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE). | ||
| """ | ||
| seq_len = torch.max(position_ids) + 1 | ||
| if seq_len > self.max_seq_len_cached: # growth | ||
| inv_freq, self.attention_scaling = self.rope_init_fn( | ||
| self.config, device, seq_len=seq_len, **self.rope_kwargs | ||
| ) | ||
| self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation | ||
| self.max_seq_len_cached = seq_len | ||
| base = config.rope_parameters["rope_theta"] | ||
| dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads | ||
|
|
||
| if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset | ||
| self.register_buffer("inv_freq", self.original_inv_freq, persistent=False) | ||
| self.max_seq_len_cached = self.original_max_seq_len | ||
| attention_factor = 1.0 # Unused in this type of RoPE | ||
|
|
||
| @torch.no_grad() | ||
| def forward(self, x, position_ids): | ||
| if "dynamic" in self.rope_type: | ||
| self._dynamic_frequency_update(position_ids, device=x.device) | ||
| # Compute the inverse frequencies | ||
| inv_freq = 1.0 / ( | ||
| base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float) / dim) | ||
| ) | ||
| return inv_freq, attention_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring to align with transformers v5 has removed the _dynamic_frequency_update method and its invocation from the forward method. This effectively disables dynamic RoPE scaling for this model, which was previously supported. This is a significant change in functionality and might be unintentional. If dynamic scaling is still desired, a custom implementation compatible with the 3D position_ids used in this model should be provided, as the standard @dynamic_rope_update decorator from transformers is not suitable for this case.
| def __init__(self, config: Qwen2VLConfig, device=None): | ||
| super().__init__() | ||
| # BC: "rope_type" was originally "type" | ||
| if hasattr(config, "rope_scaling") and config.rope_scaling is not None: | ||
| self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type")) | ||
| else: | ||
| self.rope_type = "default" | ||
| self.max_seq_len_cached = config.max_position_embeddings | ||
| self.original_max_seq_len = config.max_position_embeddings | ||
|
|
||
| self.config = config | ||
| self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type] | ||
|
|
||
| inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device) | ||
| self.rope_type = self.config.rope_parameters["rope_type"] | ||
| rope_init_fn: Callable = self.compute_default_rope_parameters | ||
| if self.rope_type != "default": | ||
| rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type] | ||
| inv_freq, self.attention_scaling = rope_init_fn(self.config, device) | ||
|
|
||
| self.register_buffer("inv_freq", inv_freq, persistent=False) | ||
| self.original_inv_freq = self.inv_freq | ||
| self.original_inv_freq = inv_freq | ||
|
|
||
| def _dynamic_frequency_update(self, position_ids, device): | ||
| @staticmethod | ||
| def compute_default_rope_parameters( | ||
| config: Optional[Qwen2VLConfig] = None, | ||
| device: Optional["torch.device"] = None, | ||
| seq_len: Optional[int] = None, | ||
| ) -> tuple["torch.Tensor", float]: | ||
| """ | ||
| dynamic RoPE layers should recompute `inv_freq` in the following situations: | ||
| 1 - growing beyond the cached sequence length (allow scaling) | ||
| 2 - the current sequence length is in the original scale (avoid losing precision with small sequences) | ||
| Computes the inverse frequencies according to the original RoPE implementation | ||
| Args: | ||
| config ([`~transformers.PreTrainedConfig`]): | ||
| The model configuration. | ||
| device (`torch.device`): | ||
| The device to use for initialization of the inverse frequencies. | ||
| seq_len (`int`, *optional*): | ||
| The current sequence length. Unused for this type of RoPE. | ||
| Returns: | ||
| Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the | ||
| post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE). | ||
| """ | ||
| seq_len = torch.max(position_ids) + 1 | ||
| if seq_len > self.max_seq_len_cached: # growth | ||
| inv_freq, self.attention_scaling = self.rope_init_fn( | ||
| self.config, device, seq_len=seq_len, **self.rope_kwargs | ||
| ) | ||
| self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: may break with compilation | ||
| self.max_seq_len_cached = seq_len | ||
| base = config.rope_parameters["rope_theta"] | ||
| dim = getattr(config, "head_dim", None) or config.hidden_size // config.num_attention_heads | ||
|
|
||
| if seq_len < self.original_max_seq_len and self.max_seq_len_cached > self.original_max_seq_len: # reset | ||
| self.register_buffer("inv_freq", self.original_inv_freq, persistent=False) | ||
| self.max_seq_len_cached = self.original_max_seq_len | ||
| attention_factor = 1.0 # Unused in this type of RoPE | ||
|
|
||
| @torch.no_grad() | ||
| def forward(self, x, position_ids): | ||
| if "dynamic" in self.rope_type: | ||
| self._dynamic_frequency_update(position_ids, device=x.device) | ||
| # Compute the inverse frequencies | ||
| inv_freq = 1.0 / ( | ||
| base ** (torch.arange(0, dim, 2, dtype=torch.int64).to(device=device, dtype=torch.float) / dim) | ||
| ) | ||
| return inv_freq, attention_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to other models in this PR, the refactoring of Qwen2VLRotaryEmbedding has removed the _dynamic_frequency_update method and its call within forward. This disables the dynamic RoPE scaling feature for this model. This could be a regression if the feature is still intended to be supported. Please consider re-implementing this functionality in a way that is compatible with the new transformers API and the model's custom 3D position_ids.
|
The WAN model also employs RoPE—consider refactoring it in tandem to ensure consistency and reduce code duplication. |
|
Could you please update the review with the precision alignment results after the refactoring? This will help ensure consistency and correctness in the redesigned implementation. |
Sure, I'll set up a comparison experiment later. |
The RoPE in WAN model is not imported from the transformers library, so it is not affected by transformers version upgrades. |
|
I think this PR is ready for review. @FoolPlayer @Luosuu @Coach257 @Crystal-jiang |
|
It appears that VeOmni has developed a new approach for adapting to transformers v5, so we will temporarily close this PR. |
Thanks for you PR! Yes, we plan to adapt v5 like #392. |

What does this PR do?
I noticed that VeOmni plans to update transformers to v5: #268
Coincidentally, while recently conducting SFT tests for Qwen3 in VeOmni, I discovered some breaking changes between Transformers v4 and v5. This PR addresses a subset of these issues.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includemisc,ci,config,docs,data,dist,omni,logging,model,optim,ckpt,release,task,perf,ops,parallel,like[ci, data, model]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][parallel, model] feat: dynamic batchingTest
API and Usage Example
Design & Code Changes
is_safetensors_available(), and sincesafetensorsis now a required dependency, the conditional check code has been removed.rope_scalinghas been removed and replaced with therope_parameters.rope_typehas been removed, with its logic moved to thecompute_default_rope_parameters()function in each respective modeling file.Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always