CUDA error: device-side assert triggered while fine tuning on my dataset #2313

AmirMohammadFakhimi · 2024-08-26T12:41:50Z

AmirMohammadFakhimi
Aug 26, 2024

Hello everyone,

I'm currently using the Whisper-large-v3 model on an Nvidia A6000 graphics card with approximately 47 GB of RAM. I successfully fine-tuned this model on the 'fa' portion of the Mozilla 17.0 dataset without any issues. For this process, I followed this guide from Hugging Face and used the corresponding Colab notebook.

Now, I have my own dataset, which contains around 250 hours of data. When I attempted to fine-tune the Whisper model on my dataset using the same approach as with the Mozilla data, I encountered the following error multiple times:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [1167,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"`
 failed.

The final error message I received was:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I tried to compile with the following settings, but I still encountered the same RuntimeError: CUDA error: device-side assert triggered error:

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['TORCH_USE_CUDA_DSA'] = '1'

Interestingly, when I adjusted the amount of data used for training, the step at which the error occurred also changed. For example, using the entire dataset caused the error on the 12th step, but limiting the dataset size pushed the error to the 80th step. I monitored the remaining RAM during training and found that I still had about 7 GB free.

I suspect the issue might be related to the labels. When I replaced my dataset's original labels with random labels from the Mozilla dataset, the error disappeared. I used Whisper's tokenizer to create the labels.

I would greatly appreciate any insights or suggestions!

Thank you!

Answered by AmirMohammadFakhimi

Sep 30, 2024

Hello everyone,

I've resolved the issue! The problem was that the length of the labels exceeded Whisper's max_target_positions configuration. For instance, the default max_target_positions for whisper-large-v3 is 448 tokens. You can either trim your labels or adjust the configuration.

Additionally, I submitted a pull request that aims to prevent such issues in future versions of transformers. For more information on similar issues, feel free to check out this issue and this one.

View full answer

glangford · 2024-08-26T13:48:53Z

glangford
Aug 26, 2024

fyi in case any of this related discussion and the debugging process is helpful

Error finetuning Whisper using new tokenizer huggingface/transformers#25503

3 replies

AmirMohammadFakhimi Aug 27, 2024
Author

Thank you for the quick response!

Actually, I didn't train a new tokenizer. When I checked the label IDs, they were within the range of 94 and 50256 (the default vocabulary size) plus 50257 (<|endoftext|>), 50258 (<|startoftranscript|>), 50300 (<|fa|>), 50360 (<|transcribe|>) and 50364 (<|notimestamps|>) which seems logical to me. I also saw the same tokens in Mozilla.

If you have any more insights, I would really appreciate it.

starxa2 Sep 30, 2024

Hello, i am getting this same error. Have you resolved this ??

Any response would be appreciated

AmirMohammadFakhimi Sep 30, 2024
Author

I've provided an answer. If you encounter any further issues, please let me know.

AmirMohammadFakhimi · 2024-09-30T20:09:52Z

AmirMohammadFakhimi
Sep 30, 2024
Author

Hello everyone,

I've resolved the issue! The problem was that the length of the labels exceeded Whisper's max_target_positions configuration. For instance, the default max_target_positions for whisper-large-v3 is 448 tokens. You can either trim your labels or adjust the configuration.

Additionally, I submitted a pull request that aims to prevent such issues in future versions of transformers. For more information on similar issues, feel free to check out this issue and this one.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA error: device-side assert triggered while fine tuning on my dataset #2313

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA error: device-side assert triggered while fine tuning on my dataset #2313

Uh oh!

AmirMohammadFakhimi Aug 26, 2024

Replies: 2 comments · 3 replies

Uh oh!

glangford Aug 26, 2024

Uh oh!

AmirMohammadFakhimi Aug 27, 2024 Author

Uh oh!

starxa2 Sep 30, 2024

Uh oh!

AmirMohammadFakhimi Sep 30, 2024 Author

Uh oh!

AmirMohammadFakhimi Sep 30, 2024 Author

AmirMohammadFakhimi
Aug 26, 2024

Replies: 2 comments 3 replies

glangford
Aug 26, 2024

AmirMohammadFakhimi Aug 27, 2024
Author

AmirMohammadFakhimi Sep 30, 2024
Author

AmirMohammadFakhimi
Sep 30, 2024
Author