Trying to start from a saved state makes it start from zero again

Hi, I'll describe what happened, I enable saving states before starting training a Flux LoRA. half-way during the training (in this case epoch8/16) I had to stop the training. I came back later and trying to resume the training got me the **acceletate KeyError = 'step'**, which I solved following a couple past issues on this repo where they recommend downgrading Accelerate on the SD_Scripts folder to 0.31.

Now my issue is that the resuming starts, but reading the terminal I found these lines:

```
INFO     Could not load random states                checkpointing.py:254
INFO     Loading in 0 custom states                   accelerator.py:3135
```

The checkpointing.py file shows this block of code:

```
# Random states
    try:
        states = torch.load(input_dir.joinpath(f"{RNG_STATE_NAME}_{process_index}.pkl"))
        random.setstate(states["random_state"])
        np.random.set_state(states["numpy_random_seed"])
        torch.set_rng_state(states["torch_manual_seed"])
        if is_xpu_available():
            torch.xpu.set_rng_state_all(states["torch_xpu_manual_seed"])
        else:
            torch.cuda.set_rng_state_all(states["torch_cuda_manual_seed"])
        if is_torch_xla_available():
            xm.set_rng_state(states["xm_seed"])
        logger.info("All random states loaded successfully")
    except Exception:
        logger.info("Could not load random states")
```

And surely, the step count starts from 0% again and 0/1936 instead of 968/1936. The countdown shows the initial estimate as well (4 hours. Was already 2 hours training when I stopped midway). Why is not starting from the 50% epoch/steps? Why it couldn't load the random states file?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trying to start from a saved state makes it start from zero again #336

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Trying to start from a saved state makes it start from zero again #336

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions