Degraded performance when resuming from checkpoint

### System Info

```Shell
- `Accelerate` version: 1.13.0.dev0
- Platform: Linux-6.12.38-gentoo-dist-x86_64-AMD_Ryzen_5_7600X_6-Core_Processor-with-glibc2.41
- `accelerate` bash location: /home/desktop/flash_env/bin/accelerate
- Python version: 3.12.11
- Numpy version: 2.3.1
- PyTorch version: 2.7.1+cu128
- PyTorch accelerator: CUDA
- System RAM: 93.41 GB
- GPU type: NVIDIA GeForce RTX 3060
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 2, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': False, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
```

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [x] My own task or dataset (give details below)

### Reproduction

[repro.py](https://github.com/user-attachments/files/23893412/repro.py)

when I run the above script on a single gpu, resuming from a checkpoint results in a loss curve that closely matches the baseline of just running straight through, when I use accelerate launch running straight through in a single shot creates a loss curve that resembles the single gpu baseline, but when I resume a checkpoint while using accelerate launch it shows degraded performance.

commands:
1)  generate baseline single gpu
`CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 200`
2)  generate checkpoint
`CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 1`
3)  resume checkpoint
`CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 200 --resume`

4) generate baseline deepspeed zero 3
`accelerate launch repro.py --batch-size 10 --max-steps 200`
5) generate checkpoint
`accelerate launch repro.py --batch-size 10 --max-steps 1`
4) resume checkpoint
`accelerate launch repro.py --batch-size 10 --max-steps 200 --resume`


if its not just my machine, you should see a trace like below, 

<img width="2100" height="834" alt="Image" src="https://github.com/user-attachments/assets/c7eba943-6d95-49a4-b242-4fab7a835f74" />


### Expected behavior

all the traces should overlap within reason.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Degraded performance when resuming from checkpoint #3869

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Degraded performance when resuming from checkpoint #3869

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions