CUDA error: device-side assert triggered while fine tuning on my dataset #2313
-
Hello everyone, I'm currently using the Whisper-large-v3 model on an Nvidia A6000 graphics card with approximately 47 GB of RAM. I successfully fine-tuned this model on the 'fa' portion of the Mozilla 17.0 dataset without any issues. For this process, I followed this guide from Hugging Face and used the corresponding Colab notebook. Now, I have my own dataset, which contains around 250 hours of data. When I attempted to fine-tune the Whisper model on my dataset using the same approach as with the Mozilla data, I encountered the following error multiple times:
The final error message I received was:
I tried to compile with the following settings, but I still encountered the same
Interestingly, when I adjusted the amount of data used for training, the step at which the error occurred also changed. For example, using the entire dataset caused the error on the 12th step, but limiting the dataset size pushed the error to the 80th step. I monitored the remaining RAM during training and found that I still had about 7 GB free. I suspect the issue might be related to the labels. When I replaced my dataset's original labels with random labels from the Mozilla dataset, the error disappeared. I used Whisper's tokenizer to create the labels. I would greatly appreciate any insights or suggestions! Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
fyi in case any of this related discussion and the debugging process is helpful |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, I've resolved the issue! The problem was that the length of the labels exceeded Whisper's Additionally, I submitted a pull request that aims to prevent such issues in future versions of transformers. For more information on similar issues, feel free to check out this issue and this one. |
Beta Was this translation helpful? Give feedback.
Hello everyone,
I've resolved the issue! The problem was that the length of the labels exceeded Whisper's
max_target_positions configuration
. For instance, the defaultmax_target_positions
forwhisper-large-v3
is448
tokens. You can either trim your labels or adjust the configuration.Additionally, I submitted a pull request that aims to prevent such issues in future versions of transformers. For more information on similar issues, feel free to check out this issue and this one.