Skip to content

Degraded performance when resuming from checkpointΒ #3869

@mdenomme24

Description

@mdenomme24

System Info

- `Accelerate` version: 1.13.0.dev0
- Platform: Linux-6.12.38-gentoo-dist-x86_64-AMD_Ryzen_5_7600X_6-Core_Processor-with-glibc2.41
- `accelerate` bash location: /home/desktop/flash_env/bin/accelerate
- Python version: 3.12.11
- Numpy version: 2.3.1
- PyTorch version: 2.7.1+cu128
- PyTorch accelerator: CUDA
- System RAM: 93.41 GB
- GPU type: NVIDIA GeForce RTX 3060
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 2, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': False, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

repro.py

when I run the above script on a single gpu, resuming from a checkpoint results in a loss curve that closely matches the baseline of just running straight through, when I use accelerate launch running straight through in a single shot creates a loss curve that resembles the single gpu baseline, but when I resume a checkpoint while using accelerate launch it shows degraded performance.

commands:

  1. generate baseline single gpu
    CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 200

  2. generate checkpoint
    CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 1

  3. resume checkpoint
    CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 200 --resume

  4. generate baseline deepspeed zero 3
    accelerate launch repro.py --batch-size 10 --max-steps 200

  5. generate checkpoint
    accelerate launch repro.py --batch-size 10 --max-steps 1

  6. resume checkpoint
    accelerate launch repro.py --batch-size 10 --max-steps 200 --resume

if its not just my machine, you should see a trace like below,

Image

Expected behavior

all the traces should overlap within reason.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions