Skip to content

Advanced training has an issue when saving checkpoints #7229

@landmann

Description

@landmann

Describe the bug

I'm running both

https://github.com/huggingface/diffusers/blob/main/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py

and

https://github.com/huggingface/diffusers/blob/main/examples/advanced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py

and they both have issues around saving the checkpoint -different issues, but issues nonetheless.

Reproduction

Just train it

Logs

For sdxl:


Traceback (most recent call last):
  File "/home/ubuntu/notebooks/../scripts/train_lora_sdxl.py", line 2234, in <module>
    main(args)
  File "/home/ubuntu/notebooks/../scripts/train_lora_sdxl.py", line 2007, in main
    accelerator.save_state(save_path)
  File "/home/ubuntu/myenv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2767, in save_state
    hook(self._models, weights, output_dir)
  File "/home/ubuntu/notebooks/../scripts/train_lora_sdxl.py", line 1420, in save_model_hook
    embedding_handler.save_embeddings(f"{output_dir}/{args.output_dir}_emb.safetensors")
  File "/home/ubuntu/notebooks/../scripts/train_lora_sdxl.py", line 787, in save_embeddings
    save_file(tensors, file_path)
  File "/home/ubuntu/myenv/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
safetensors_rust.SafetensorError: Error while serializing: IoError(Os { code: 2, kind: NotFound, message: "No such file or directory" })

For 1.5:

Traceback (most recent call last):
  File "/home/ubuntu/myenv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/myenv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/myenv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/home/ubuntu/myenv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/myenv/bin/python3', '../scripts/train_lora_sd_15.py', '--pretrained_model_name_or_path=/home/ubuntu/models/sd15/fml', '--dataset_name=/home/ubuntu/nate_pics_768/', '--output_dir=/home/ubuntu/nate_models/lora_fml', "--instance_prompt='a TOK man'", '--gradient_accumulation_steps=1', '--caption_column=prompt', '--train_batch_size=4', '--repeats=1', '--mixed_precision=bf16', '--resolution=768', '--gradient_checkpointing', '--learning_rate=1.0', '--text_encoder_lr=1.0', '--adam_beta2=0.99', '--optimizer=prodigy', '--train_text_encoder_ti', '--train_text_encoder_ti_frac=0.5', '--token_abstraction=TOK', '--snr_gamma=5.0', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--rank=32', '--max_train_steps=2160', '--checkpointing_steps=10', '--seed=0', '--with_prior_preservation', '--prior_generation_precision=bf16', '--sample_batch_size=1', "--class_prompt='a man'", '--class_data_dir=/home/ubuntu/notebooks/man_4321_imgs_768x768px', '--report_to=wandb', '--validation_prompt=a TOK man, professional headshot, hyperdetailed photography, soft light, head and shoulders portrait, cover', '--num_validation_images=3', '--validation_epochs=200']' returned non-zero exit status 1.


### System Info

linux distro - nothing special.

### Who can help?

@sayakpaul

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions