-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
I used the 'swin' checkpoint and attempted to fine-tune the model using the command below.
CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/cuhk_sysu.yaml --resume --ckpt swin_tiny_cnvrtd.pth OUTPUT_DIR './results' SOLVER.BASE_LR 0.00003 EVAL_PERIOD 5 MODEL.BONE 'swin_tiny' INPUT.BATCH_SIZE_TRAIN 4 MODEL.SEMANTIC_WEIGHT 0.8
However, during the second epoch, the loss became 'nan' as shown below
Epoch: [1] [ 920/2801] eta: 0:27:47 lr: 0.000300 loss: 8.7676 (8.7104) loss_proposal_cls: 0.2417 (0.2408) loss_proposal_reg: 2.5734 (2.3512) loss_box_cls: 0.6964 (0.7277) loss_box_reg: 0.2329 (0.2421) loss_box_reid: 4.3837 (4.5777) loss_rpn_reg: 0.0509 (0.0585) loss_rpn_cls: 0.4375 (0.5125) time: 0.8908 data: 0.0003 max mem: 25115
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.1927, device='cuda:0', grad_fn=<MulBackward0>), 'loss_proposal_reg': tensor(2.3873, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_cls': tensor(0.6746, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reg': tensor(0.2970, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_reg': tensor(0.1049, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.4566, device='cuda:0', grad_fn=<MulBackward0>)}
Could you please provide the trained checkpoint to perform inference on it?
MedlarTea and AlecDusheck
Metadata
Metadata
Assignees
Labels
No labels