- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6.5k
Closed
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
Description
Describe the bug
When training the Autoencoderkl model, its loss does not converge on the ImageNet dataset. Unlike
this.
Reproduction
Script
accelerate launch --multi_gpu --num_processes=2 --gpu_ids=0,1 \
     train_autoencoderkl.py \
    --pretrained_model_name_or_path stabilityai/sd-vae-ft-mse \
    --max_train_steps 850000 \
    --validation_steps 100 \
    --checkpointing_steps 1000 \
    --gradient_accumulation_steps 2 \
    --learning_rate 4.5e-6 \
    --lr_scheduler cosine \
    --report_to wandb \
    --mixed_precision bf16 \
    --train_batch_size 8 \
    --dataloader_num_workers 16 \
    --output_dir autoencoderkl-model/imagenet \
    --train_data_dir /datasets/image/imagenet-test/train \
    --validation_image ./val/ILSVRC2012_val_00000293.JPEG ./val/ILSVRC2012_val_00002138.JPEG \
    --resolution 128 \
Logs
Logs
System Info
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.17
- Running on Google Colab?: No
- Python version: 3.8.20
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.1
- Transformers version: 4.46.3
- Accelerate version: 1.0.1
- PEFT version: not installed
- Bitsandbytes version: 0.45.4
- Safetensors version: 0.5.3
- xFormers version: 0.0.28.post1
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB
 NVIDIA GeForce RTX 3090, 24576 MiB
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
