Summary
When the total loss is NaN, we can throw an error to stop the training. Otherwise, time is wasted training a model with parameters to be NaN, as seen in the case of deepmodeling/dpgen#1460.
Detailed Description
- NaN can be checked when the total loss is on the CPU (but not the GPU), to avoid extra cost. For example, when writing to
lcurve.out, the result is on the CPU.
- Check the results before the checkpoint is written, so no checkpoint with NaN is written.
- Implement the feature for TensorFlow, PyTorch, and PaddlePaddle backends.
Further Information, Files, and Links
No response