-
Notifications
You must be signed in to change notification settings - Fork 2.1k
feat(dsv3):Runnable N1C8 configs #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Thanks for your contribution! |
a1c5bb4
to
1b6c1f4
Compare
$LAUNCH_CMD \ | ||
--run_mode=collective \ | ||
${script:-run_pretrain.py} \ | ||
$@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
写个文档说明一下数据集制作,模型下载(比如FP8转BF16),以及如何训练
paddleformers/trainer/trainer.py
Outdated
RowParallelQuantizationLinear, | ||
) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块不用try保留原本逻辑就行
# if pp_first, the order = ["dp", "pp", "moe_sharding", "sharding", "sep", "ep", "mp"] | ||
# if sharding_first, the order is ["dp", "moe_sharding", "sharding", "pp", "sep", "ep", "mp"] | ||
order.insert(sd_idx, "moe_sharding") | ||
if not os.getenv("DSV3_FAST_PRETRAIN", "False"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的处理太hack了,有没有更合理的传参命名不要用环境变量
fleet.init(is_collective=True, strategy=strategy) | ||
logger.info(strategy) | ||
|
||
if os.getenv("DSV3_FAST_PRETRAIN", "False"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
# limitations under the License. | ||
|
||
|
||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数针对DSV3定制化的建议,放到example里面
attention_dropout=0.0, | ||
speculate_model_type=False, | ||
using_flex_token=False, | ||
use_dualpipev=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议Config也写一个DeepseekV2FastConfig(DeepseekV2Config),保持原本类的简洁性
|
||
import contextlib | ||
import math | ||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有不通用修改都放到modeling_fast目录中,这也会给忠慧SFT模型组网迁入和后续我们组网通用模块改造带来困难
import paddle.nn as nn | ||
from paddle.distributed.fleet.meta_parallel import ( | ||
LayerDesc, | ||
LocalSharedLayerDesc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要动原来的modeling_pp.py 写一个modeling_pp_fast.py
async_finish=False, | ||
allocate_on_comm_stream=False, | ||
recv_x, recv_token_probs, states, event = fused_dispatch_forward_func( | ||
x, token_indices, token_probs, num_experts, group, previous_event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些都不要会影响其他模块的使用吗?
group_scores = ( | ||
scores_for_choice.reshape([bsz_seq_len, self.n_group, -1]).topk(2, axis=-1)[0].sum(axis=-1) | ||
) # fmt:skip [n, n_group] | ||
reshape_tmp_rst = scores_for_choice.reshape([bsz_seq_len, self.n_group, -1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些修改是否有影响
DeepseekV3 Pretrained目前在PaddleFormers看起来引入了很多新组件,并且不够规范化,这块建议暂时把组网和新组件先放在example中,等功能ready完善后再合入PaddleFormers |
…SV3_USE_ATTEN_RECOMPUTE DSV3_USE_FP8_DISPATCH USE_DS_GEMM into config.json
…3) move load_hf_ckpt
a8b9ba6
to
d0f203f
Compare
@@ -0,0 +1,1678 @@ | |||
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该就不需要fast文件了,原来的fast是为了在modeling.py以外单独修改不共用的模块,现在都放在modeling.py里面就可以了
export FLAGS_large_pool_pre_alloc_in_mb=61440 | ||
export FLAGS_deep_ep_comm_prealloc_in_mb=1000 | ||
|
||
export DSV3_FAST_PRETRAIN=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里删掉
|
||
# mpirun sh script/kill_process.sh | ||
# mpirun rm -rf output | ||
nohup bash script/train_gpu.sh ./config/pretrain_argument.json --dsv3_fast_pretrain=True > run.log 2>&1 & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dsv3_fast_pretrain放在config里吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来使用 DSV3_FAST_PRETRAIN 的地方是在 training_args.py 的 TrainingArguments 创建,创建是早于读取 config.json 的,所以这时候还没有 config 的内容
) | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path, download_hub="huggingface") | ||
config = DeepseekV2FastConfig.from_pretrained("./config/config.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
路径不要在这里写死了吧,像之前一样写在config里
# Flags for best performance | ||
export FLAGS_share_tensor_for_grad_tensor_holder=1 | ||
export FLAGS_use_default_stream=false | ||
export =false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉
# if pp_first, the order = ["dp", "pp", "moe_sharding", "sharding", "sep", "ep", "mp"] | ||
# if sharding_first, the order is ["dp", "moe_sharding", "sharding", "pp", "sep", "ep", "mp"] | ||
order.insert(sd_idx, "moe_sharding") | ||
if not self.dsv3_fast_pretrain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.dsv3_fast_pretrain 这里代码改动是因为什么导致的,建议不要用一个指定的模型名字来作为开关。比如如果是dualpipe需要开这个开关,那么命名为apply_dual_pipe之类?
@@ -0,0 +1,100 @@ | |||
# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
预训练需要可加回paddleformers/data目录中
需要一个文档说明需要的硬件资源、cuda nccl版本要求、如何安装依赖、预训练数据集准备、训练启动、模型合参(如果需要)、还有模型如何转化回一个可推理的权重 |
…ove dsv3 code into new directory
No description provided.