Skip to content

The loss becomes 'nan' as early as the second epoch during fine-tuning  #3

@Jeba-create

Description

@Jeba-create

I used the 'swin' checkpoint and attempted to fine-tune the model using the command below.
CUDA_VISIBLE_DEVICES=0 python train.py --cfg configs/cuhk_sysu.yaml --resume --ckpt swin_tiny_cnvrtd.pth OUTPUT_DIR './results' SOLVER.BASE_LR 0.00003 EVAL_PERIOD 5 MODEL.BONE 'swin_tiny' INPUT.BATCH_SIZE_TRAIN 4 MODEL.SEMANTIC_WEIGHT 0.8
However, during the second epoch, the loss became 'nan' as shown below

Epoch: [1]  [ 920/2801]  eta: 0:27:47  lr: 0.000300  loss: 8.7676 (8.7104)  loss_proposal_cls: 0.2417 (0.2408)  loss_proposal_reg: 2.5734 (2.3512)  loss_box_cls: 0.6964 (0.7277)  loss_box_reg: 0.2329 (0.2421)  loss_box_reid: 4.3837 (4.5777)  loss_rpn_reg: 0.0509 (0.0585)  loss_rpn_cls: 0.4375 (0.5125)  time: 0.8908  data: 0.0003  max mem: 25115
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.1927, device='cuda:0', grad_fn=<MulBackward0>), 'loss_proposal_reg': tensor(2.3873, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_cls': tensor(0.6746, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reg': tensor(0.2970, device='cuda:0', grad_fn=<MulBackward0>), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_reg': tensor(0.1049, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_cls': tensor(0.4566, device='cuda:0', grad_fn=<MulBackward0>)}

Could you please provide the trained checkpoint to perform inference on it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions