-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
System Info
- `Accelerate` version: 1.13.0.dev0
- Platform: Linux-6.12.38-gentoo-dist-x86_64-AMD_Ryzen_5_7600X_6-Core_Processor-with-glibc2.41
- `accelerate` bash location: /home/desktop/flash_env/bin/accelerate
- Python version: 3.12.11
- Numpy version: 2.3.1
- PyTorch version: 2.7.1+cu128
- PyTorch accelerator: CUDA
- System RAM: 93.41 GB
- GPU type: NVIDIA GeForce RTX 3060
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 2, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': False, 'zero3_save_16bit_model': True, 'zero_stage': 3}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
when I run the above script on a single gpu, resuming from a checkpoint results in a loss curve that closely matches the baseline of just running straight through, when I use accelerate launch running straight through in a single shot creates a loss curve that resembles the single gpu baseline, but when I resume a checkpoint while using accelerate launch it shows degraded performance.
commands:
-
generate baseline single gpu
CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 200 -
generate checkpoint
CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 1 -
resume checkpoint
CUDA_VISIBLE_DEVICES="0" python repro.py --batch-size 20 --max-steps 200 --resume -
generate baseline deepspeed zero 3
accelerate launch repro.py --batch-size 10 --max-steps 200 -
generate checkpoint
accelerate launch repro.py --batch-size 10 --max-steps 1 -
resume checkpoint
accelerate launch repro.py --batch-size 10 --max-steps 200 --resume
if its not just my machine, you should see a trace like below,
Expected behavior
all the traces should overlap within reason.