Fix: try aligning dtype of matrixes when training with deepspeed and mixed-precision is set to bf16 or fp16#2060
Conversation
…ry aligning precision when using mixed precision in training process
…:saibit-tech/sd-scripts into dev/xy/align_dtype_using_mixed_precision
…_precision Fix: try aligning dtype of matrixes when training with deepspeed and mixed-precision is set to bf16 or fp16
kohya-ss
left a comment
There was a problem hiding this comment.
I don't personally use DeepSpeed but this PR looks good to me. I would appreciate it if you could check the comments.
|
My questions are of the requirements. Where was deepspeed coming from before? Is updating to 2.6.0 and having diffusers automatically update it a good idea with various different ways you need to install torch for backend compatibility? Is the diffusers[torch] extra holding this back? Seems like it would be aligned in the same way with the torch version being the same version. Some environments are still only supporting torch 2.4 still so moving to the latest (2.6.0) as a requirement might cause some issues. |
I have tried installing deepspeed using Then I add
|
* get device type from model * add logger warning * format * format * format
kohya-ss
left a comment
There was a problem hiding this comment.
Sorry for the delay. I'd like to confirm requirements.txt.
|
Thank you for update! |
|
Hi, but when I turn to mixed_precision=bf16, it still arises the [mat1 and mat2 must have the same dtype, but got Float and BFloat16] error. I am running script "flux_train_control_net.py" and the command is accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_control_net.py --pretrained_model_name_or_path flux1-dev.safetensors --clip_l clip_l.safetensors --t5xxl t5xxl_fp16.safetensors --ae ae.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 1 --seed 42 --gradient_checkpointing --mixed_precision bf16 --optimizer_type adamw8bit --learning_rate 2e-5 --highvram --max_train_epochs 1 --save_every_n_steps 1000 --output_dir /path/to/output --output_name flux-cn --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 --deepspeed --dataset_config dataset.toml --log_tracker_name "sd-scripts-flux-cn" In addition, the machine only supports CUDA12.2, so I download pytorch2.4.0 with cu121 channel. Is it possible to cause that problem? |
What problem is going to solve in this PR?
This PR is mainly trying fixing the problem subscribed in issue #1871
When I tried to do some training with script
flux_train.py, I met the same error as the issue above.When I removed
--deepspeedand run with lowtrain_batch_sizewhich makes training slow.Solution
I tried adding a wrapper in
deepspeed_utils.pyto wrap models' forward function withtorch.autocastwhich provides convenience methods for mixed precision.Changes detailed
__warp_with_torch_autocastto classDeepSpeedWrapperindeepspeed_utils.py.deepspeed==0.16.7requirements.accelerator.distributed_type == DistributedType.DEEPSPEEDin functionpatch_accelerator_for_fp16_trainingof scripttrain_util.pybecause deepspeed internally handles loss scaling for mixed precision training thenaccelerator.scalerwould be None which results in the same error as issue 476After these changes, the dtype error disappeared and
train_batch_sizeincreased from 2(without deepspeed) to 12(with deepspeed and mixed-precision) running on8x Nvidia A100 GPUs(80GB memory each) and get 17.54% speeding up with command as follow: