Loss is NAN Training Stopped.

Working my way through the code I ran into this breaking error:

when running this code

```
# Train the model for three epochs
for epoch in range(num_epochs):
    # train for one epoch, printing every iteration
    train_one_epoch(model, optimizer, data_loader_train, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_val, device=device)

    checkpoint_path = f'trained_model_{epoch+1}_epochs.pth'
    torch.save(model.state_dict(), checkpoint_path)
```
It produces the following error:

Epoch: [0]  [ 0/65]  eta: 0:01:55  lr: 0.000125  loss: 2.9749 (2.9749)  loss_classifier: 1.1639 (1.1639)  loss_box_reg: 0.0148 (0.0148)  loss_objectness: 1.5950 (1.5950)  loss_rpn_box_reg: 0.2011 (0.2011)  time: 1.7780  data: 0.4853  max mem: 5038
Loss is nan, stopping training
{'loss_classifier': tensor(1.3285, device='cuda:0', grad_fn=<NllLossBackward0>), 'loss_box_reg': tensor(0.0082, device='cuda:0', grad_fn=<DivBackward0>), 'loss_objectness': tensor(nan, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), 'loss_rpn_box_reg': tensor(0.1605, device='cuda:0', grad_fn=<DivBackward0>)}

An exception has occurred, use %tb to see the full traceback.

SystemExit: 1


/home/q/anaconda3/envs/xview3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3556: UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss is NAN Training Stopped. #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss is NAN Training Stopped. #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions