-
Notifications
You must be signed in to change notification settings - Fork 257
Description
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
一个最简单的用SST2数据微调Bert的任务,我用命令拉起任务:msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --join=True bert_classify.py。查看worker_log,各个worker的epoch loss不一致,而且每个worker的训练步数都是1053/1053(67349/32),相当于把训练集都跑了一遍,没有起到分布式加速的作用。
Describe the solution you'd like
A clear and concise description of what you want to happen.
理想情况下,用msrun拉起八卡的微调任务应该是八张卡数据并行来训练,而不是每张卡各跑各的。
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
1.我尝试用下面的代码开启分布式通信和数据并行的初始化,但是为了通信的同步训练速度更慢了,而且数据并行也没有实现,仍然是每张卡各跑各的
from mindspore.communication import init
from mindspore import context
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
init()
2.按照官方的说明,我试了自己做数据集的切分,在训练时确实每张卡只跑了1/8的数据,但是最终结果无法同步
train_dataset = GeneratorDataset(source=train_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
eval_dataset = GeneratorDataset(source=eval_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
test_dataset = GeneratorDataset(source=test_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
Additional context
Add any other context or screenshots about the feature request here.
代码已上传,由于mindnlp的trainer把训练过程都封装了,我没有尝试再加Reduce机制改写代码。所以具体怎么在mindnlp下拉起分布式任务加速训练,请求大佬指导
bert_classify.py