Skip to content

Commit 0b5e3ae

Browse files
Fix RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf (#3981)
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
1 parent 97cd326 commit 0b5e3ae

File tree

1 file changed

+2
-3
lines changed

1 file changed

+2
-3
lines changed

megatron/core/rerun_state_machine.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -431,10 +431,9 @@ def train_step(data_iterator, ...):
431431
log_single_rank(
432432
logger,
433433
logging.WARNING,
434-
"Exiting now. A checkpoint at the last iteration is being saved "
435-
"if further examination is needed",
434+
"Exiting now. The job can be resumed from a previous checkpoint",
436435
)
437-
return True, True, EXIT_CODE_FAILED_ON_RESULT_VALIDATION
436+
return False, True, EXIT_CODE_FAILED_ON_RESULT_VALIDATION
438437
elif self.state == RerunState.WILL_RERUN_FROM_CHECKPOINT:
439438
log_single_rank(
440439
logger,

0 commit comments

Comments
 (0)