-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[feat /train] 增加一键启动多机多卡分布式训练shell #7333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @liangxiaoyun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求引入了一项新功能,通过提供一键式shell脚本,极大地简化了大型语言模型的多机多卡分布式训练的设置和执行过程。它不仅提供了全面的文档,还为Qwen3-32B和Qwen2.5-72B模型提供了特定的训练脚本,旨在优化训练流程并提升用户体验。 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
这个 PR 旨在增加一键启动多机多卡分布式训练的 shell 脚本,这是一个非常有用的功能。
代码审查发现了一些可以改进的地方,主要集中在脚本的可维护性、可移植性和安全性方面:
- 代码重复:
Qwen2_5_72B_agent_sft.sh和Qwen3_32B_agent_sft.sh两个脚本内容基本相同,只有少数参数不同。这会导致维护困难。建议将它们合并成一个通用的启动脚本,通过命令行参数或配置文件来传递模型特定的配置(如模型路径、batch size 等)。 - 硬编码路径: 脚本中包含了大量用户特定的硬编码路径(例如
/data_large_v2/liangxiaoyun/...)。这使得脚本无法在其他环境或被其他用户直接使用。建议将这些路径作为脚本参数或从配置文件中读取。 - 安全问题: 脚本和文档中都使用了
--ssh_password参数来传递密码,这存在严重的安全风险(密码会出现在进程列表和 shell 历史中)。强烈建议改用 SSH 密钥认证,这更安全也更方便。 - 文档问题:
Echo_README.md中的一些 Markdown 语法有误,并且 JSON 格式示例不正确,可能会误导用户。
具体的修改建议已在代码评论中给出。修复这些问题将大大提高脚本的质量和可用性。
| MODEL_PATH="/data_large_v2/liangxiaoyun/model_output/Qwen2.5-72B-Instruct" | ||
| LOAD_PATH="/data_large_v2/liangxiaoyun/model_output/Qwen2.5-72B-Instruct-megatron" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #!/bin/bash | ||
|
|
||
| # ========== 环境配置 ========== | ||
| export PYTHONPATH=\$PYTHONPATH:/data_large_v2/liangxiaoyun/projects/Megatron-LM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @@ -0,0 +1,555 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | 参数 | 是否必须 | 默认值 | 说明 | | ||
| |------|----------|--------|------| | ||
| | `--ips` | 是 | - | 分布式训练的节点IP地址列表,多个IP用逗号分隔 | | ||
| | `--ssh_password` | 是 | - | 分布式训练的节点的帐户密码 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| echo "必需参数:" | ||
| echo " --ips <string> 节点IP列表,逗号分隔 (与 --ip_file 二选一)" | ||
| echo " --ip_file <file> 节点IP列表文件,每行一个IP (与 --ips 二选一)" | ||
| echo " --ssh_password <string> 节点账户密码" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
|
|
||
| ## 安装 | ||
| 参考wiki文档:【Megatron-swift 环境安装】(https://iwiki.woa.com/p/4016971017) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | ||
| "id": str "数据唯一ID", | ||
| "tools": List[Dict[str, Any]] 工具列表, | ||
| "messages": List[Dict[str, str]] 对话列表, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此处提供的 JSON 示例格式不正确,混合了 Python 类型提示和中文描述,无法被解析。这可能会误导需要准备数据的用户。建议提供一个真实、有效的 JSON 数据示例,以帮助用户理解所需的数据结构。
| { | |
| "id": str "数据唯一ID", | |
| "tools": List[Dict[str, Any]] 工具列表, | |
| "messages": List[Dict[str, str]] 对话列表, | |
| } | |
| { | |
| "id": "some_unique_id", | |
| "tools": [{"name": "example_tool"}], | |
| "messages": [{"role": "user", "content": "Hello!"}] | |
| } |
| # | ||
| # ============================================================ | ||
|
|
||
| # set -e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| echo " --pipeline_model_parallel_size <int> 训练轮数 (默认: ${PP})" | ||
| echo " --context_parallel_size <int> 训练轮数 (默认: ${CP})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个参数的说明文字 "训练轮数" 是错误的,应该是从 --epochs 参数复制粘贴过来的。请修正为对应参数的正确描述,以避免用户混淆。
| echo " --pipeline_model_parallel_size <int> 训练轮数 (默认: ${PP})" | |
| echo " --context_parallel_size <int> 训练轮数 (默认: ${CP})" | |
| echo " --pipeline_model_parallel_size <int> 流水线并行度 (默认: ${PP})" | |
| echo " --context_parallel_size <int> 上下文并行度 (默认: ${CP})" |
| sleep 2 | ||
|
|
||
| # 其他配置 | ||
| export TF_CPP_MIN_LOG_LEVEL=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR type
PR information
Write the detail information belongs to this PR.
Experiment results
Paste your experiment result here(if needed).