Split manually corrected srt to fine-tunning audio datasets #1927

marceltud · 2023-12-29T21:47:51Z

marceltud
Dec 29, 2023

Hi,

After running whisper I get the transcription in many formats e.g. srt or vtt.
I have manually corrected the transcription and would like to train back/fine-tune whisper model with this improved srt or vtt file.

Do you know if there is any script that could convert the srt and the corresponding audio file into the structure needed to be fine-tuned.

Based on my research this is the accepted type of dataset:
https://huggingface.co/docs/datasets/audio_load

I tried to use https://github.com/jumon/whisper-finetuning but it does not really work, it creates the train files with the associated timestamps and actually has some bugs of its own e.g. while splitting the srt and audio file, I end-up with the same text several times. Unfortunately the maintainer does not have the time to take care of it, and anyway the training on a tiny model was taking me 16hours on T4x2. I'm afraid the jumon is not an option, I spent 1-2 days trying to fix it but had to realise that my python skills are still rather basic...

So I was wondering whether there is a solution for converting an srt into a dataset that can be then further used to fine-tune the model.

Thank you very much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split manually corrected srt to fine-tunning audio datasets #1927

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Split manually corrected srt to fine-tunning audio datasets #1927

Uh oh!

marceltud Dec 29, 2023

Replies: 0 comments

marceltud
Dec 29, 2023