Skip to content

zero3保存模型通信错误 #1013

@MaxwellDing

Description

@MaxwellDing

sft训练Wan2.1-T2V-14B,开启zero3,在accelerate.yaml中设置了zero3_save_16bit_model: true,存储模型通信卡死,根据日志发现卡死原因,rank 0使用op allgather_base,其他rank使用op allreduce

换用zero2可以正常工作

这是deepspeed的bug吗,还是diffsynth模型需要适配?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions