Add TP support for Qwen3-Dense model online training.#77
Add TP support for Qwen3-Dense model online training.#77yubofredwang merged 22 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @Ximingwang-09, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request expands our system's capability by integrating new configuration files for specific Qwen3 model variants. These additions are crucial for enabling the proper loading and utilization of the Qwen3 32B Eagle3 and Qwen3 4B Eagle3 models within our framework, providing their essential architectural parameters.
Highlights
- New Model Configurations: I've added two new JSON configuration files for Qwen3 models, specifically the 'Eagle3' variants.
- Qwen3 32B Eagle3 Support: A configuration file (
configs/qwen3-32b-eagle3.json) has been introduced to define the parameters for the Qwen3 32B Eagle3 model, including its architecture (LlamaForCausalLMEagle3), hidden size, attention heads, and vocabulary size. - Qwen3 4B Eagle3 Support: A separate configuration file (
configs/qwen3-4b-eagle3.json) has been added to specify the parameters for the Qwen3 4B Eagle3 model, detailing its architecture, hidden size, attention heads, and vocabulary size.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request adds two new configuration files for qwen3-32b-eagle3 and qwen3-4b-eagle3 models. The configurations look good, but I've identified a few minor formatting and consistency issues that should be addressed to improve maintainability and prevent potential issues with tooling. Specifically, both files contain a UTF-8 BOM, are missing a final newline character, and use an integer for rope_theta where a float would be more consistent with other configurations in the project. I've provided suggestions to fix these.
configs/qwen3-32b-eagle3.json
Outdated
| @@ -0,0 +1,31 @@ | |||
| { | |||
configs/qwen3-32b-eagle3.json
Outdated
| "num_key_value_heads": 8, | ||
| "rms_norm_eps": 1e-06, | ||
| "rope_scaling": null, | ||
| "rope_theta": 1000000, |
There was a problem hiding this comment.
For consistency with other model configurations in the repository (e.g., qwen3-235B-A22B-eagle3.json), it's better to represent rope_theta as a float. While this may not cause issues with the current implementation, standardizing on the float type improves maintainability and readability, especially since this value is used in floating-point calculations.
"rope_theta": 1000000.0,
configs/qwen3-32b-eagle3.json
Outdated
| "use_sliding_window": false, | ||
| "vocab_size": 151936, | ||
| "draft_vocab_size": 32000 | ||
| } No newline at end of file |
configs/qwen3-4b-eagle3.json
Outdated
| @@ -0,0 +1,31 @@ | |||
| { | |||
configs/qwen3-4b-eagle3.json
Outdated
| "num_key_value_heads": 8, | ||
| "rms_norm_eps": 1e-06, | ||
| "rope_scaling": null, | ||
| "rope_theta": 1000000, |
There was a problem hiding this comment.
For consistency with other model configurations in the repository (e.g., qwen3-235B-A22B-eagle3.json), it's better to represent rope_theta as a float. While this may not cause issues with the current implementation, standardizing on the float type improves maintainability and readability, especially since this value is used in floating-point calculations.
"rope_theta": 1000000.0,
configs/qwen3-4b-eagle3.json
Outdated
| "use_sliding_window": false, | ||
| "vocab_size": 151936, | ||
| "draft_vocab_size": 32000 | ||
| } No newline at end of file |
|
Have you tried training, Can you share the results of your training? Could you please provide the training script? |
Yes, I'm trying to train a qwen3-32b eagle model, I'll share the results and script after I finish it. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen3 dense models, including new configuration files and a model implementation with tensor parallelism. The changes are well-structured, but I've identified a couple of critical issues in the tensor parallelism logic within the new Qwen3Attention module that could lead to incorrect model outputs or runtime errors. Addressing these is important for the correctness and robustness of the implementation.
| if self.config._attn_implementation != "eager": | ||
| attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] | ||
|
|
||
| attn_output, attn_weights = attention_interface( |
There was a problem hiding this comment.
This line will cause a ValueError at runtime if self.config._attn_implementation is set to an implementation (like flash_attention_2) that returns more than two values (e.g., when use_cache=True). The code currently unpacks exactly two return values, but some attention backends return a third value (past_key_value).
To fix this, you should handle the variable number of return values from attention_interface.
A second, related issue is that the function's return type hint on line 207 is incorrect. It should be updated to match the actual return values, which are (attn_output, attn_weights).
| attn_output = self.o_proj(attn_output) | ||
| # Add all_reduce for TP | ||
| dist.all_reduce(attn_output, op=dist.ReduceOp.SUM, group=self.tp_group) |
There was a problem hiding this comment.
There is a critical issue here with how the bias of the output projection (o_proj) is handled in tensor parallelism. The o_proj is a RowParallelLinear layer, and if it has a bias (i.e., attention_bias=True in the config), the bias is added to the output on each TP rank before the all_reduce operation. This results in the bias being incorrectly scaled by the tensor parallel world size (tp_size).
While the configs in this PR set attention_bias: false, the code should be robust for cases where it is true (which is the default in transformers.Qwen3Config). The bias should be added after the all_reduce operation.
| attn_output = self.o_proj(attn_output) | |
| # Add all_reduce for TP | |
| dist.all_reduce(attn_output, op=dist.ReduceOp.SUM, group=self.tp_group) | |
| attn_output = torch.nn.functional.linear(attn_output, self.o_proj.weight) | |
| dist.all_reduce(attn_output, op=dist.ReduceOp.SUM, group=self.tp_group) | |
| if self.o_proj.bias is not None: | |
| attn_output += self.o_proj.bias |
configs/qwen3-32b-eagle3.json
Outdated
| "use_sliding_window": false, | ||
| "vocab_size": 151936, | ||
| "draft_vocab_size": 32000 | ||
| } No newline at end of file |
configs/qwen3-4b-eagle3.json
Outdated
| "use_sliding_window": false, | ||
| "vocab_size": 151936, | ||
| "draft_vocab_size": 32000 | ||
| } No newline at end of file |
I found that the current project doesn't support TP for Qwen dense model, which may lead to OOM. So I implemented it briefly. Training has now started successfully, and I will keep you updated on the training results in a timely manner. |
I validated by training the Qwen3-4B model with 2000 ShareGPT samples on 8 H20 GPUs. The command is above, and results are below: |
|
@sleepcoo I think it's ready, Can you help me review this? Thanks. |
* add more qwen config * support qwen3 dense tp * fix * fix * fix * lint * lint fix * fix lint * fix lint * lint * fix * fix * fix * clean * lint fix * align head_dim with qwen config --------- Co-authored-by: 纬杭 <ximing.wxm@antgroup.com>
Add TP support for Qwen3-Dense model online training. And More Qwen3 model config support.
Related Issue:#85
run_qwen3_dense_eagle3_online.sh
cmd
bash run_qwen3_dense_eagle3_online.sh 8