-
-
Notifications
You must be signed in to change notification settings - Fork 128
Trying to start from a saved state makes it start from zero again #336
Description
Hi, I'll describe what happened, I enable saving states before starting training a Flux LoRA. half-way during the training (in this case epoch8/16) I had to stop the training. I came back later and trying to resume the training got me the acceletate KeyError = 'step', which I solved following a couple past issues on this repo where they recommend downgrading Accelerate on the SD_Scripts folder to 0.31.
Now my issue is that the resuming starts, but reading the terminal I found these lines:
INFO Could not load random states checkpointing.py:254
INFO Loading in 0 custom states accelerator.py:3135
The checkpointing.py file shows this block of code:
# Random states
try:
states = torch.load(input_dir.joinpath(f"{RNG_STATE_NAME}_{process_index}.pkl"))
random.setstate(states["random_state"])
np.random.set_state(states["numpy_random_seed"])
torch.set_rng_state(states["torch_manual_seed"])
if is_xpu_available():
torch.xpu.set_rng_state_all(states["torch_xpu_manual_seed"])
else:
torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])
if is_torch_xla_available():
xm.set_rng_state(states["xm_seed"])
logger.info("All random states loaded successfully")
except Exception:
logger.info("Could not load random states")
And surely, the step count starts from 0% again and 0/1936 instead of 968/1936. The countdown shows the initial estimate as well (4 hours. Was already 2 hours training when I stopped midway). Why is not starting from the 50% epoch/steps? Why it couldn't load the random states file?