[fsdp, model] feat: add sp for qwen3.5 fsdp grpo training#5920
[fsdp, model] feat: add sp for qwen3.5 fsdp grpo training#5920Zhang1Sheng wants to merge 1 commit intoverl-project:mainfrom
Conversation
|
|
There was a problem hiding this comment.
Code Review
This pull request implements support for Ulysses sequence parallelism (LASP) in Qwen 3.5 models by patching the Gated DeltaNet forward pass and the attention mask application logic. The changes include sharding depthwise convolution weights, implementing all-to-all communication for linear attention heads, and slicing parameters like A_log and dt_bias to align with local ranks. Review feedback identified critical shape mismatch issues in the patched forward pass, specifically regarding the use of the unpatched mask function and incorrect tensor dimensions for convolution updates and functions.
| cache_params: Qwen3_5DynamicCache | None = None, | ||
| attention_mask: torch.Tensor | None = None, | ||
| ): | ||
| hidden_states = apply_mask_to_padding_states(hidden_states, attention_mask) |
There was a problem hiding this comment.
The call to apply_mask_to_padding_states uses the original function imported from transformers, which does not handle sequence parallelism slicing of the attention mask. This will lead to a shape mismatch error when ulysses_sp_size > 1 because hidden_states is sharded but attention_mask is not. Use the patched qwen3_5_apply_mask_to_padding_states defined in this file instead.
| hidden_states = apply_mask_to_padding_states(hidden_states, attention_mask) | |
| hidden_states = qwen3_5_apply_mask_to_padding_states(hidden_states, attention_mask) |
| mixed_qkv = self.causal_conv1d_update( | ||
| mixed_qkv, | ||
| conv_state, | ||
| self.conv1d.weight.squeeze(1), | ||
| self.conv1d.bias, | ||
| self.activation, | ||
| ) |
There was a problem hiding this comment.
When seq_len == 1, mixed_qkv has shape [B, 1, D]. However, causal_conv1d_update expects a 2D tensor of shape [B, D]. Additionally, the result must be unsqueezed to [B, D, 1] so that the subsequent transpose(1, 2) at line 422 correctly restores the [B, 1, D] shape required for the split operation at line 424.
| mixed_qkv = self.causal_conv1d_update( | |
| mixed_qkv, | |
| conv_state, | |
| self.conv1d.weight.squeeze(1), | |
| self.conv1d.bias, | |
| self.activation, | |
| ) | |
| mixed_qkv = self.causal_conv1d_update( | |
| mixed_qkv.squeeze(1), | |
| conv_state, | |
| self.conv1d.weight.squeeze(1), | |
| self.conv1d.bias, | |
| self.activation, | |
| ).unsqueeze(-1) |
| mixed_qkv = self.causal_conv1d_fn( | ||
| x=mixed_qkv, | ||
| weight=conv_weight, | ||
| bias=self.conv1d.bias, | ||
| activation=self.activation, | ||
| seq_idx=None, | ||
| ) |
There was a problem hiding this comment.
causal_conv1d_fn expects the input tensor x to have the channel dimension as the second dimension (shape [B, D, S]). Since the transpose at line 331 was removed, mixed_qkv is currently [B, S, D]. It must be transposed before calling the convolution function. The result will be [B, D, S], which is then correctly handled by the transpose(1, 2) at line 422.
| mixed_qkv = self.causal_conv1d_fn( | |
| x=mixed_qkv, | |
| weight=conv_weight, | |
| bias=self.conv1d.bias, | |
| activation=self.activation, | |
| seq_idx=None, | |
| ) | |
| mixed_qkv = self.causal_conv1d_fn( | |
| x=mixed_qkv.transpose(1, 2), | |
| weight=conv_weight, | |
| bias=self.conv1d.bias, | |
| activation=self.activation, | |
| seq_idx=None, | |
| ) |
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.