cifar10 example is not scalable with multiple GPUs

I slightly adapted the cifar10 example [in this fork](https://github.com/H4dr1en/examples/commit/4e7a1bb91d3e44bc200533157796735aac5445ae#diff-4043d140c384fa385c6ef024954211e33ef02ca56b1730bb33e9d0b9c7511619R430), basically removing python-fire and adding the torch.distributed.launch function, so that it can be executed as a standalone script with [clearml-task](https://github.com/allegroai/clearml/blob/master/docs/clearml-task.md).

I executed the following script with `nproc_per_node` in [1, 2, 3, 4] on a AWS g4dn.12xlarge instance (x4 T4 GPUs). I got the following results:

- batch size=16, nproc_per_node=1 => Runtime: 29:53
- batch size=16, nproc_per_node=1 => Runtime: 05:34
Here I disabled DataParallel as mentionned in https://github.com/pytorch/ignite/issues/2447
- batch size=32, nproc_per_node=2 => Runtime: 17:11
- batch size=48, nproc_per_node=3 => Runtime: 11:33
- batch size=64, nproc_per_node=4 => Runtime: 08:47

I am increasing the batch size by 16 each time I add a GPU, so that each GPU has the same number of samples. I didn't change the default number of processes (8) for all of them, because I didn't oberserve that the GPUs were under-used (below 95%)

<details>
<summary>GPU utilization as reported by clearml</summary>

![newplot(8)](https://user-images.githubusercontent.com/16240134/152165849-11f50ad1-7049-466e-8918-3fdcb65aed96.png)
</details>


I was expecting to observe a quasi-linear time improvement, but it isn't the case. Am I missing something?

PS: Here are the requirements I used to execute the script
```
torch==1.7.1+cu110
torchvision==0.8.2
pytorch-ignite==0.4.8
clearml==.1.1.6
tensorboardX==2.4.1
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cifar10 example is not scalable with multiple GPUs #75

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cifar10 example is not scalable with multiple GPUs #75

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions