Skip to content

Distributed Environment Issue in Training #16

@YiboLi-4110

Description

@YiboLi-4110

Wonderful work you guys did!

But I have a samll question:
In train.py, from 343 to 360, it can be seen that the precision will be selected according to different distributed training modes
In accelerate_config_4_gpu.yaml, it has specially set up DeepSpeed. However, in train_4_gpu.sh, it sets MULTI_GPU through command-line parameters.
So, I would like to know in which distributed environment the trained weights you provided were actually trained?

Because I found that when I used the repository code and made no modifications, the video quality generated by the two models that I trained on 4 A800 was slightly worse than the weights you provided.
Specifically, the model I trained performed somewhat poorly in terms of its ability to distinguish the magnitude of forces and the unity of objects after being subjected to forces.

Do you think this is caused by the randomness of training? Is it still caused by this distributed environment in the code? Or is it a problem in some other aspect?

Looking forward to your reply. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions