Distributed Environment Issue in Training

Wonderful work you guys did!

But I have a samll question:
In `train.py`, from 343 to 360, it can be seen that the precision will be selected according to different distributed training modes
In `accelerate_config_4_gpu.yaml`, it has specially set up **DeepSpeed**. However, in `train_4_gpu.sh`, it sets **MULTI_GPU** through command-line parameters.
So, I would like to know in which distributed environment the trained weights you provided were actually trained?

Because I found that when I used the repository code and made no modifications, the video quality generated by the two models that I trained on 4 A800 was slightly worse than the weights you provided. 
Specifically, the model I trained performed somewhat poorly in terms of its ability to **distinguish the magnitude of forces and the unity of objects after being subjected to forces**.

Do you think this is caused by the randomness of training? Is it still caused by this distributed environment in the code? Or is it a problem in some other aspect?

Looking forward to your reply. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Environment Issue in Training #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distributed Environment Issue in Training #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions