The loss becomes 'nan' as early as the second epoch during fine-tuning 

I used the 'swin' checkpoint and attempted to fine-tune the model using the command below.
`CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/cuhk_sysu.yaml --resume --ckpt swin_tiny_cnvrtd.pth OUTPUT_DIR './results' SOLVER.BASE_LR 0.00003 EVAL_PERIOD 5 MODEL.BONE 'swin_tiny' INPUT.BATCH_SIZE_TRAIN 4 MODEL.SEMANTIC_WEIGHT 0.8`
However, during the second epoch, the loss became 'nan' as shown below
```
Epoch: [1]  [ 920/2801]  eta: 0:27:47  lr: 0.000300  loss: 8.7676 (8.7104)  loss_proposal_cls: 0.2417 (0.2408)  loss_proposal_reg: 2.5734 (2.3512)  loss_box_cls: 0.6964 (0.7277)  loss_box_reg: 0.2329 (0.2421)  loss_box_reid: 4.3837 (4.5777)  loss_rpn_reg: 0.0509 (0.0585)  loss_rpn_cls: 0.4375 (0.5125)  time: 0.8908  data: 0.0003  max mem: 25115
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.1927, device='cuda:0', grad_fn=<MulBackward0>), 'loss_proposal_reg': tensor(2.3873, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_cls': tensor(0.6746, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reg': tensor(0.2970, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_reg': tensor(0.1049, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.4566, device='cuda:0', grad_fn=<MulBackward0>)}
```

Could you please provide the trained checkpoint to perform inference on it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The loss becomes 'nan' as early as the second epoch during fine-tuning #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions