ResNet-a2 #1409

mingqiJ · 2022-08-10T17:35:16Z

mingqiJ
Aug 10, 2022

Hi,
@rwightman
I'm currently reproducing ResNet50 using a2 procedure(ResNet strikes back: An improved training procedure in timm). I would like to ask whether the following instruction can reproduce ResNet50 using a2 procedure to train perfectly?

I used 4 Tesla V100 to train:
./distributed_train.sh 4 imagenet/ --model resnet50 --aa rand-m7-mstd0.5-inc1 --mixup .1 --cutmix 1.0 --aug-repeats 3 --remode pixel --reprob 0.0 --crop-pct 0.95 --drop-path .05 --smoothing 0.0 --bce-loss --bce-target-thresh 0.2 --opt lamb --weight-decay .02 --sched cosine --epochs 300 --warmup-epochs 5 --lr 5e-3 --warmup-lr 1e-4 -b 512 -j 16 --amp --channels-last --seed 42

mingqiJ · 2022-08-11T17:45:19Z

mingqiJ
Aug 11, 2022
Author

When I use this command run on the dgx sever, the GPU utile is not 100% at the most time, GPU wait for CPU(load data). So it is very slow than just traing on my desktop (2 1080Ti). Do es anybody know why it is? @rwightman

4 replies

rwightman Aug 11, 2022
Maintainer

@mingqiJ the command looks good, as for speed, I assume you're in the cloud? GCP and AWS instances are REALLY slow if you're trying to train off a local disk and do augmentation... they don't give you enough CPUs per GPU and the drives are too slow (even SSD) to store your dataset as individual files. You need to use sharded datasets (TFDS, webdataset, etc) and you may need more optimized image pipelines. At very least you need to install Pillow-SIMD instead of base Pillow

mingqiJ Aug 11, 2022
Author

I use my schools every, It has DGX node, I put the dataset and code on the scratch of the node and run, the GPU util is not 100% at the most time, and the CPU util is also low

rwightman Aug 11, 2022
Maintainer

@mingqiJ if both CPU and GPU are low, it's probably slow storage (you need to use an SSD if it's file based dataset, don't use magnetic disks or network shares). If that's not it, something is likely wrong with the machine... also your -j is too high, realistically 6-10 dataloader workers per distributed process makes more sense in most cases.

mingqiJ Aug 11, 2022
Author

Thanks, Maybe it is the machine problem. I decrease it to -j =2 because when I create the python3.8 environment to train, it told me max number of worker in current system is 2

.

It is still very slow. like 100 min or more 1 epoch. 4 V100.

rwightman · 2022-08-11T20:29:37Z

rwightman
Aug 11, 2022
Maintainer

FYI the best cloud setup I've found for training is Lambda Labs GPU cloud, their 4 GPU A100 or A6000 instances have a decent number of CPUs and fast local SSD that's good enough for a standard imagenet (as files and folder) dataset.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ResNet-a2 #1409

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

ResNet-a2 #1409

Uh oh!

Uh oh!

mingqiJ Aug 10, 2022

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

mingqiJ Aug 11, 2022 Author

Uh oh!

rwightman Aug 11, 2022 Maintainer

Uh oh!

mingqiJ Aug 11, 2022 Author

Uh oh!

rwightman Aug 11, 2022 Maintainer

Uh oh!

mingqiJ Aug 11, 2022 Author

Uh oh!

rwightman Aug 11, 2022 Maintainer

mingqiJ
Aug 10, 2022

Replies: 2 comments 4 replies

mingqiJ
Aug 11, 2022
Author

rwightman Aug 11, 2022
Maintainer

mingqiJ Aug 11, 2022
Author

rwightman Aug 11, 2022
Maintainer

mingqiJ Aug 11, 2022
Author

rwightman
Aug 11, 2022
Maintainer