Skip to content

cifar10 example is not scalable with multiple GPUs #75

@H4dr1en

Description

@H4dr1en

I slightly adapted the cifar10 example in this fork, basically removing python-fire and adding the torch.distributed.launch function, so that it can be executed as a standalone script with clearml-task.

I executed the following script with nproc_per_node in [1, 2, 3, 4] on a AWS g4dn.12xlarge instance (x4 T4 GPUs). I got the following results:

  • batch size=16, nproc_per_node=1 => Runtime: 29:53
  • batch size=16, nproc_per_node=1 => Runtime: 05:34
    Here I disabled DataParallel as mentionned in DataParallel is used by auto_model with single GPU pytorch/ignite#2447
  • batch size=32, nproc_per_node=2 => Runtime: 17:11
  • batch size=48, nproc_per_node=3 => Runtime: 11:33
  • batch size=64, nproc_per_node=4 => Runtime: 08:47

I am increasing the batch size by 16 each time I add a GPU, so that each GPU has the same number of samples. I didn't change the default number of processes (8) for all of them, because I didn't oberserve that the GPUs were under-used (below 95%)

GPU utilization as reported by clearml

newplot(8)

I was expecting to observe a quasi-linear time improvement, but it isn't the case. Am I missing something?

PS: Here are the requirements I used to execute the script

torch==1.7.1+cu110
torchvision==0.8.2
pytorch-ignite==0.4.8
clearml==.1.1.6
tensorboardX==2.4.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions