Skip to content

mindnlp用msrun拉起分布式任务,结果是在每张卡上跑一遍相同的模型和代码? #2186

@YiYi97

Description

@YiYi97

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
一个最简单的用SST2数据微调Bert的任务,我用命令拉起任务:msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --join=True bert_classify.py。查看worker_log,各个worker的epoch loss不一致,而且每个worker的训练步数都是1053/1053(67349/32),相当于把训练集都跑了一遍,没有起到分布式加速的作用。

Describe the solution you'd like
A clear and concise description of what you want to happen.
理想情况下,用msrun拉起八卡的微调任务应该是八张卡数据并行来训练,而不是每张卡各跑各的。

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
1.我尝试用下面的代码开启分布式通信和数据并行的初始化,但是为了通信的同步训练速度更慢了,而且数据并行也没有实现,仍然是每张卡各跑各的
from mindspore.communication import init
from mindspore import context
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
init()
2.按照官方的说明,我试了自己做数据集的切分,在训练时确实每张卡只跑了1/8的数据,但是最终结果无法同步
train_dataset = GeneratorDataset(source=train_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
eval_dataset = GeneratorDataset(source=eval_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
test_dataset = GeneratorDataset(source=test_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)

Additional context
Add any other context or screenshots about the feature request here.
代码已上传,由于mindnlp的trainer把训练过程都封装了,我没有尝试再加Reduce机制改写代码。所以具体怎么在mindnlp下拉起分布式任务加速训练,请求大佬指导
bert_classify.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions