You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After running whisper I get the transcription in many formats e.g. srt or vtt.
I have manually corrected the transcription and would like to train back/fine-tune whisper model with this improved srt or vtt file.
Do you know if there is any script that could convert the srt and the corresponding audio file into the structure needed to be fine-tuned.
I tried to use https://github.com/jumon/whisper-finetuning but it does not really work, it creates the train files with the associated timestamps and actually has some bugs of its own e.g. while splitting the srt and audio file, I end-up with the same text several times. Unfortunately the maintainer does not have the time to take care of it, and anyway the training on a tiny model was taking me 16hours on T4x2. I'm afraid the jumon is not an option, I spent 1-2 days trying to fix it but had to realise that my python skills are still rather basic...
So I was wondering whether there is a solution for converting an srt into a dataset that can be then further used to fine-tune the model.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
After running whisper I get the transcription in many formats e.g. srt or vtt.
I have manually corrected the transcription and would like to train back/fine-tune whisper model with this improved srt or vtt file.
Do you know if there is any script that could convert the srt and the corresponding audio file into the structure needed to be fine-tuned.
Based on my research this is the accepted type of dataset:
https://huggingface.co/docs/datasets/audio_load
I tried to use https://github.com/jumon/whisper-finetuning but it does not really work, it creates the train files with the associated timestamps and actually has some bugs of its own e.g. while splitting the srt and audio file, I end-up with the same text several times. Unfortunately the maintainer does not have the time to take care of it, and anyway the training on a tiny model was taking me 16hours on T4x2. I'm afraid the jumon is not an option, I spent 1-2 days trying to fix it but had to realise that my python skills are still rather basic...
So I was wondering whether there is a solution for converting an srt into a dataset that can be then further used to fine-tune the model.
Thank you very much!
Beta Was this translation helpful? Give feedback.
All reactions