Training extremely slow

Hello,

I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:

```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train_imagenet.py --config-file rn50_configs/rn50_88_epochs.yaml \
    --data.train_dataset=$HOME/data/imagenet_ffcv/train_500_0.50_90.ffcv \
    --data.val_dataset=$HOME/data/imagenet_ffcv/val_500_0.50_90.ffcv \
    --data.num_workers=3 --data.in_memory=1 \
    --logging.folder=$HOME/experiments/ffcv/rn50_88_epochs
```

Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:

```bash
cat ~/experiments/ffcv/rn50_88_epochs/d9ef0d7f-17a3-4e57-8d93-5e7c9a110d66/log 
{"timestamp": 1650641704.0822473, "relative_time": 2853.3256430625916, "current_lr": 0.8473609134615385, "top_1": 0.07225999981164932, "top_5": 0.19789999723434448, "val_time": 103.72948884963989, "train_loss": null, "epoch": 0}
{"timestamp": 1650644358.3394542, "relative_time": 5507.582849979401, "current_lr": 1.6972759134615385, "top_1": 0.16143999993801117, "top_5": 0.3677400052547455, "val_time": 92.9171462059021, "train_loss": null, "epoch": 1}
```

Is there anything I should check?

Thank you in advance for your response.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training extremely slow #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training extremely slow #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions