Actions: NVIDIA/Megatron-LM
Actions
2,500+ workflow runs
2,500+ workflow runs
RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf
Community Bot
#9806:
Issue comment #3981 (comment)
created
by
svcnvidia-nemo-ci