Skip to content

Loss is NAN Training Stopped. #18

@A-n-o-r-a-k

Description

@A-n-o-r-a-k

Working my way through the code I ran into this breaking error:

when running this code

# Train the model for three epochs
for epoch in range(num_epochs):
    # train for one epoch, printing every iteration
    train_one_epoch(model, optimizer, data_loader_train, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_val, device=device)

    checkpoint_path = f'trained_model_{epoch+1}_epochs.pth'
    torch.save(model.state_dict(), checkpoint_path)

It produces the following error:

Epoch: [0] [ 0/65] eta: 0:01:55 lr: 0.000125 loss: 2.9749 (2.9749) loss_classifier: 1.1639 (1.1639) loss_box_reg: 0.0148 (0.0148) loss_objectness: 1.5950 (1.5950) loss_rpn_box_reg: 0.2011 (0.2011) time: 1.7780 data: 0.4853 max mem: 5038
Loss is nan, stopping training
{'loss_classifier': tensor(1.3285, device='cuda:0', grad_fn=), 'loss_box_reg': tensor(0.0082, device='cuda:0', grad_fn=), 'loss_objectness': tensor(nan, device='cuda:0', grad_fn=), 'loss_rpn_box_reg': tensor(0.1605, device='cuda:0', grad_fn=)}

An exception has occurred, use %tb to see the full traceback.

SystemExit: 1

/home/q/anaconda3/envs/xview3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3556: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions