-
Notifications
You must be signed in to change notification settings - Fork 12
Description
I slightly adapted the cifar10 example in this fork, basically removing python-fire and adding the torch.distributed.launch function, so that it can be executed as a standalone script with clearml-task.
I executed the following script with nproc_per_node
in [1, 2, 3, 4] on a AWS g4dn.12xlarge instance (x4 T4 GPUs). I got the following results:
- batch size=16, nproc_per_node=1 => Runtime: 29:53
- batch size=16, nproc_per_node=1 => Runtime: 05:34
Here I disabled DataParallel as mentionned in DataParallel is used by auto_model with single GPU pytorch/ignite#2447 - batch size=32, nproc_per_node=2 => Runtime: 17:11
- batch size=48, nproc_per_node=3 => Runtime: 11:33
- batch size=64, nproc_per_node=4 => Runtime: 08:47
I am increasing the batch size by 16 each time I add a GPU, so that each GPU has the same number of samples. I didn't change the default number of processes (8) for all of them, because I didn't oberserve that the GPUs were under-used (below 95%)
I was expecting to observe a quasi-linear time improvement, but it isn't the case. Am I missing something?
PS: Here are the requirements I used to execute the script
torch==1.7.1+cu110
torchvision==0.8.2
pytorch-ignite==0.4.8
clearml==.1.1.6
tensorboardX==2.4.1