mindnlp用msrun拉起分布式任务，结果是在每张卡上跑一遍相同的模型和代码？

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
一个最简单的用SST2数据微调Bert的任务，我用命令拉起任务：msrun --worker_num=8 --local_worker_num=8 --master_port=8118 --join=True bert_classify.py。查看worker_log，各个worker的epoch loss不一致，而且每个worker的训练步数都是1053/1053（67349/32），相当于把训练集都跑了一遍，没有起到分布式加速的作用。

**Describe the solution you'd like**
A clear and concise description of what you want to happen.
理想情况下，用msrun拉起八卡的微调任务应该是八张卡数据并行来训练，而不是每张卡各跑各的。

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
1.我尝试用下面的代码开启分布式通信和数据并行的初始化，但是为了通信的同步训练速度更慢了，而且数据并行也没有实现，仍然是每张卡各跑各的
  from mindspore.communication import init
  from mindspore import context
  context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
  ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
  init()
2.按照官方的说明，我试了自己做数据集的切分，在训练时确实每张卡只跑了1/8的数据，但是最终结果无法同步
  train_dataset = GeneratorDataset(source=train_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
 eval_dataset = GeneratorDataset(source=eval_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)
 test_dataset = GeneratorDataset(source=test_dataset, column_names=['text','label','task'], shuffle=False, num_shards=rank_size, shard_id=rank_id)

**Additional context**
Add any other context or screenshots about the feature request here.
代码已上传，由于mindnlp的trainer把训练过程都封装了，我没有尝试再加Reduce机制改写代码。所以具体怎么在mindnlp下拉起分布式任务加速训练，请求大佬指导
[bert_classify.py](https://github.com/user-attachments/files/22761303/bert_classify.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mindnlp用msrun拉起分布式任务，结果是在每张卡上跑一遍相同的模型和代码？ #2186

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mindnlp用msrun拉起分布式任务，结果是在每张卡上跑一遍相同的模型和代码？ #2186

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions