Skip to content

Using 4 GPUs for training takes the same time as using just 1 #202

@MiguelCosta94

Description

@MiguelCosta94

I'm training a BigGAN with differential augmentation and LeCam optimization on a custom dataset. My setup features 4 NVIDIA RTX 3070 and I'm running on Ubuntu 20.04. I observe that running the training on the 4 GPUs, using Distributed Data Parallel takes the same time as performing the training using a single GPU. Am I doing something wrong?

For training using a single GPU, I'm using the following command:
CUDA_VISIBLE_DEVICES=0 python3 src/main.py -t -hdf5 -l -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"

For training using the 4 GPUs, I'm using the following commands:
export MASTER_ADDR=localhost
export MASTER_PORT=1234
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 src/main.py -t -DDP -tn 1 -cn 0 -std_stat -std_max 64 -std_step 64 -metrics fid is prdc -ref "train" -cfg src/configs/VWW/BigGAN-DiffAug-LeCam.yaml -data ../Datasets/vw_coco2014_96_GAN -save SAVE_PATH_VWW -mpc --post_resizer "friendly" --eval_backbone "InceptionV3_tf"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions