Skip to content

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to xxxx #19

@AlphaNext

Description

@AlphaNext

start cmd

imagenetpath=mypath
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345  moby_main.py \
       --cfg configs/moby_swin_tiny.yaml --data-path ${imagenetpath} --batch-size 256

but get the Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to xxxx error

^[[32m[2023-10-24 17:33:21 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][290/625]  eta 0:05:52 lr 0.002772 time 0.5567 (1.0516)    loss 10.5960 (10.9174)  grad_norm 1.4802 (1.5236)       mem 45716MB^[[32m[2023-10-24 17:33:38 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][300/625]  eta 0:05:47 lr 0.002785 time 0.7607 (1.0707)    loss 10.7823 (10.9141)  grad_norm 2.3465 (1.5536)       mem 45716MB^[[32m[2023-10-24 17:33:45 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][310/625]  eta 0:05:33 lr 0.002797 time 0.9247 (1.0588)    loss 10.9386 (10.9140)  grad_norm 3.8597 (1.6136)       mem 45716MBGradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 65536.0
^[[32m[2023-10-24 17:33:53 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][320/625]  eta 0:
05:20 lr 0.002810 time 0.5590 (1.0518)    loss 11.4219 (10.9264)  grad_norm 3.9233 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:00 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][330/625]  eta 0:
05:07 lr 0.002823 time 0.5751 (1.0412)    loss 11.6204 (10.9487)  grad_norm 2.7699 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:09 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][340/625]  eta 0:
04:55 lr 0.002836 time 0.5561 (1.0365)    loss 11.2880 (10.9609)  grad_norm 2.3273 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:16 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][350/625]  eta 0:
04:42 lr 0.002849 time 0.5530 (1.0271)    loss 11.0601 (10.9651)  grad_norm 0.9230 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:23 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][360/625]  eta 0:
04:30 lr 0.002861 time 0.5628 (1.0200)    loss 10.9609 (10.9669)  grad_norm 0.8707 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:30 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][370/625]  eta 0:
04:17 lr 0.002874 time 0.5648 (1.0094)    loss 10.9728 (10.9655)  grad_norm 1.9388 (inf)  mem 45716MB
^[[32m[2023-10-24 17:34:36 moby__swin_tiny__patch4_window7_224__odpr02_tdpr0_cm099_ct02_queue4096_proj2_pred2]^[[0m^[[33m(moby_main.py 177)^[[0m: INFO Train: [3/300][380/625]  eta 0:
04:04 lr 0.002887 time 0.5568 (0.9993)    loss 10.8801 (10.9645)  grad_norm 0.6718 (inf)  mem 45716MB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions