Skip to content

Training extremely slow #11

@netw0rkf10w

Description

@netw0rkf10w

Hello,

I followed closely the README and launched a training using the following command on a server with 8 V100 GPUs:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train_imagenet.py --config-file rn50_configs/rn50_88_epochs.yaml \
    --data.train_dataset=$HOME/data/imagenet_ffcv/train_500_0.50_90.ffcv \
    --data.val_dataset=$HOME/data/imagenet_ffcv/val_500_0.50_90.ffcv \
    --data.num_workers=3 --data.in_memory=1 \
    --logging.folder=$HOME/experiments/ffcv/rn50_88_epochs

Training took almost an hour per epoch, and the second epoch is almost as slow as the first one. The output of the log file is as follows:

cat ~/experiments/ffcv/rn50_88_epochs/d9ef0d7f-17a3-4e57-8d93-5e7c9a110d66/log 
{"timestamp": 1650641704.0822473, "relative_time": 2853.3256430625916, "current_lr": 0.8473609134615385, "top_1": 0.07225999981164932, "top_5": 0.19789999723434448, "val_time": 103.72948884963989, "train_loss": null, "epoch": 0}
{"timestamp": 1650644358.3394542, "relative_time": 5507.582849979401, "current_lr": 1.6972759134615385, "top_1": 0.16143999993801117, "top_5": 0.3677400052547455, "val_time": 92.9171462059021, "train_loss": null, "epoch": 1}

Is there anything I should check?

Thank you in advance for your response.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions