Whisper large multi GPU OutOfMemoryError issue #2198

sadhiin · 2024-06-01T20:14:56Z

sadhiin
Jun 1, 2024

Hi 👋🏼 mate...
Context: Last 3–4 days, I was trying to fine-tune the Whisper-learge model for Arabic transcription. Successfully, I made it run on a single GPU with a SageMaker NB instance ml.g5.8xlarge. I have large in-house dataset and I have train the model on it. Thus, I need to run it on multi-GPU setting for faster training.

Previously i followed @sanchit-gandhi post on github as well as huggingface lastly if found two readme

https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition#multi-gpu-whisper-training
https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#python-script
setting up with for large model with multi gpu. Here is my training script (a toy run to check) with 1k traing and 100 testing smaple.

deepspeed --num_gpus=4 s2s_train.py \
    --deepspeed="ds_config.json" \
    --model_name_or_path="openai/whisper-large-v2" \
    --dataset_name="mozilla-foundation/common_voice_17_0" \
    --dataset_config_name="ar" \
    --language="arabic" \
    --train_split_name="/home/ec2-user/SageMaker/cv-17/ar/train_data.csv" \
    --max_train_samples=1000 \
    --eval_split_name="/home/ec2-user/SageMaker/cv-17/ar/test_data.csv" \
    --max_eval_samples=100 \
    --num_train_epochs="5" \
    --max_steps="-1" \
    --output_dir="whisper_l2_AR" \
    --per_device_train_batch_size="32" \
    --gradient_accumulation_steps="2" \
    --per_device_eval_batch_size="16" \
    --logging_steps="25" \
    --learning_rate="2.5e-5" \
    --warmup_steps="3" \
    --eval_strategy="steps" \
    --eval_steps="100" \
    --save_strategy="steps" \
    --save_steps="100" \
    --generation_max_length="225" \
    --length_column_name="input_length" \
    --max_duration_in_seconds="30" \
    --text_column_name="sentence" \
    --freeze_feature_encoder="False" \
    --report_to="tensorboard" \
    --metric_for_best_model="wer" \
    --greater_is_better="False" \
    --load_best_model_at_end \
    --gradient_checkpointing \
    --fp16 \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --predict_with_generate

ds_config.json taken from here

Traceback error log

[INFO|trainer.py:2108] 2024-06-01 17:42:49,183 >> ***** Running training *****
[INFO|trainer.py:2109] 2024-06-01 17:42:49,183 >> Num examples = 1,000
[INFO|trainer.py:2110] 2024-06-01 17:42:49,183 >> Num Epochs = 5
[INFO|trainer.py:2111] 2024-06-01 17:42:49,183 >> Instantaneous batch size per device = 32
[INFO|trainer.py:2114] 2024-06-01 17:42:49,183 >> Total train batch size (w. parallel, distributed & accumulation) = 256
[INFO|trainer.py:2115] 2024-06-01 17:42:49,183 >> Gradient Accumulation steps = 2
[INFO|trainer.py:2116] 2024-06-01 17:42:49,183 >> Total optimization steps = 20
[INFO|trainer.py:2117] 2024-06-01 17:42:49,185 >> Number of trainable parameters = 1,541,384,960
0%| | 0/20 [00:00<?, ?it/s]/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[WARNING|logging.py:328] 2024-06-01 17:42:58,459 >> use_cache = True is incompatible with gradient checkpointing. Setting use_cache = False...
[WARNING|logging.py:328] 2024-06-01 17:42:58,894 >> use_cache = True is incompatible with gradient checkpointing. Setting use_cache = False...
[WARNING|logging.py:328] 2024-06-01 17:42:59,040 >> use_cache = True is incompatible with gradient checkpointing. Setting use_cache = False...
[WARNING|logging.py:328] 2024-06-01 17:42:59,155 >> use_cache = True is incompatible with gradient checkpointing. Setting use_cache = False...
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/s2s_train.py", line 643, in
main()
File "/home/ec2-user/SageMaker/s2s_train.py", line 593, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 1912, in train
return inner_training_loop(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 2248, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/transformers/trainer.py", line 3293, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/accelerator.py", line 2117, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 166, in backward
self.engine.backward(loss, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2056, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply
return user_fn(self, *args)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.68 GiB. GPU 3 has a total capacty of 22.19 GiB of which 985.50 MiB is free. Including non-PyTorch memory, this process has 21.21 GiB memory in use. Of the allocated memory 16.54 GiB is allocated by PyTorch, and 4.22 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-06-01 17:43:24,196] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 21447

nvidia-smi info.

Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.68 GiB. GPU 3 has a total capacty of 22.19 GiB of which 985.50 MiB is free. Including non-PyTorch memory, this process has 21.21 GiB memory in use. Of the allocated memory 16.54 GiB is allocated by PyTorch, and 4.22 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Note: Sometime it's showing GPU 0 something 1 or 2, base on the per_device_train_batch_size value.

I'll appreciate you suggestion and idea's to solve this OOM error.

Good Day 👋🏼

Answered by sanchit-gandhi

Jun 5, 2024

Hey @sadhiin - you can likely use large-v3 instead of large-v2, since it's a stronger pre-trained model:

-    --model_name_or_path="openai/whisper-large-v2" \
+    --model_name_or_path="openai/whisper-large-v3" \

Assuming you've set-up deepspeed correctly (which it looks like you have based on your config), I would recommend reducing your per-device batch-size:

    --per_device_train_batch_size="16" \
    --gradient_accumulation_steps="1" \

Note that your overall batch size will be:

per_device_train_batch_size * gradient_accumulation_steps * num_gpus

Plugging in the numbers for your specific config:

16 * 1 * 4 = 64

Which should be sufficient for fine-tuning. If you still hit OOM, reduce…

View full answer

phineas-pta · 2024-06-02T22:02:41Z

phineas-pta
Jun 2, 2024

smaller batch size

1 reply

sadhiin Jun 6, 2024
Author

Hi @phineas-pta Thanks for your suggestion.

sanchit-gandhi · 2024-06-05T09:52:19Z

sanchit-gandhi
Jun 5, 2024

Hey @sadhiin - you can likely use large-v3 instead of large-v2, since it's a stronger pre-trained model:

-    --model_name_or_path="openai/whisper-large-v2" \
+    --model_name_or_path="openai/whisper-large-v3" \

Assuming you've set-up deepspeed correctly (which it looks like you have based on your config), I would recommend reducing your per-device batch-size:

    --per_device_train_batch_size="16" \
    --gradient_accumulation_steps="1" \

Note that your overall batch size will be:

per_device_train_batch_size * gradient_accumulation_steps * num_gpus

Plugging in the numbers for your specific config:

16 * 1 * 4 = 64

Which should be sufficient for fine-tuning. If you still hit OOM, reduce your per-device batch size by another factor of two, and increase your gradient accumulation steps by a factor of two.

2 replies

sadhiin Jun 6, 2024
Author

Thank you, @sanchit-gandhi . That's really helped.

sadhiin Jun 9, 2024
Author

Hello, @sanchit-gandhi
Can you please help me for multi-gpu settings on notebook, such as this tutorial on HF by you?

Navanit-git · 2025-07-08T05:21:21Z

Navanit-git
Jul 8, 2025

@sadhiin have you finetuned it then ?

4 replies

sadhiin Jul 9, 2025
Author

Hi @Navanit-git 👋🏼 Luckily I made that.

Navanit-nebula Jul 9, 2025

kindly share it on hugging face. I am searching for best arabic tts model for normal conversations and there is currently no model which can beat large v3

sadhiin Jul 9, 2025
Author

I'm unable to share that, as the work is patented by my previous company 😞

Navanit-nebula Jul 9, 2025

sad to hear that. Its okay thanks anyways

Whisper large multi GPU OutOfMemoryError issue #2198

Uh oh!

sadhiin Jun 1, 2024

Replies: 3 comments · 7 replies

Uh oh!

phineas-pta Jun 2, 2024

Uh oh!

sadhiin Jun 6, 2024 Author

Uh oh!

sanchit-gandhi Jun 5, 2024

Uh oh!

sadhiin Jun 6, 2024 Author

Uh oh!

sadhiin Jun 9, 2024 Author

Uh oh!

Navanit-git Jul 8, 2025

Uh oh!

sadhiin Jul 9, 2025 Author

Uh oh!

Navanit-nebula Jul 9, 2025

Uh oh!

sadhiin Jul 9, 2025 Author

Uh oh!

Navanit-nebula Jul 9, 2025

sadhiin
Jun 1, 2024

Replies: 3 comments 7 replies

phineas-pta
Jun 2, 2024

sadhiin Jun 6, 2024
Author

sanchit-gandhi
Jun 5, 2024

sadhiin Jun 6, 2024
Author

sadhiin Jun 9, 2024
Author

Navanit-git
Jul 8, 2025

sadhiin Jul 9, 2025
Author

sadhiin Jul 9, 2025
Author