Train Whisper on New Language #2190

alvynabranches · 2024-05-27T11:21:16Z

alvynabranches
May 27, 2024

I want to train Whisper on Konkani speech. The transcription is available in Devanagari and Roman Script. I want to make 2 separate models for both the script. The audio recordings would be the same for each sentence / recording.

I want to train the model with Hugging Face (preferably), but other methods also are possible.

Can someone mention the general script for the task?

itaipee · 2024-05-29T09:25:05Z

itaipee
May 29, 2024

i am not sure if you can define new language easily , I think it is better to fine-tune the standard Indo language.
anyway , basic fine-tune script is here https://huggingface.co/blog/fine-tune-whisper ( by open AI )
there are several other guides:
https://wandb.ai/parambharat/whisper_finetuning/reports/Fine-Tuning-Whisper-ASR-Models---VmlldzozMTEzNDE5
https://wandb.ai/parambharat/whisper_finetuning/reports/Fine-Tuning-Whisper-for-Low-Resource-Dravidian-Languages--VmlldzozMTYyNTg0
https://medium.com/@shridharpawar77/a-comprehensive-guide-for-custom-data-fine-tuning-with-the-whisper-model-60e4cbce736d

I think you will want to look at https://wandb.ai/parambharat/whisper_finetuning/reports/Fine-Tuning-Whisper-for-Low-Resource-Dravidian-Languages--VmlldzozMTYyNTg0#data-augmentation

P.S. Whisper transcribe with Devanagari transcripts , I'm very curious about fine-tune it to Roman transcript , please share (even in general terms ) the success of your fine-tuning , and if there was significant difference between the 2 models.

2 replies

alvynabranches May 30, 2024
Author

I want to do it with Whisper model as the collection of speech data is around 400-420 hours. I am doing this as a part of my PhD project / thesis. Hence the data collection was done before applying PhD. I am only stuck right now on the training and infrastructure part.

itaipee May 30, 2024

400 hours is quite impressive , if it is manually transcribed. If it is audio only , and you need 3rd party to transcribe , the situation is more challenging.

if you are new to machine-learning , or to fine-tuning , I suggest to start with Whisper small/medium models , and use small portion of training data ( 20 hours ) . First , to make sure the procedure you have is working fine , but also , since there are reports that Whisper fine tuning is doing quite well with small amount of training data for the fine-tuning.
Than gradually move to the larger models , and more data ,and augmentation.

@phineas-pta referred to wav2vev.BERT which suppose to work better for low-resources languages , it is worth to consider using it as well.

phineas-pta · 2024-05-29T23:10:26Z

phineas-pta
May 29, 2024

whisper ability to support new language is mediocre because of the tokenizer not support your language

u may want to try out wav2vec2: https://huggingface.co/blog/fine-tune-w2v2-bert

3 replies

alvynabranches May 30, 2024
Author

But the tokenizer of Marathi language can be used to here as the preprocessing and tokenization process / pattern remains the same, but only thing is the words are way different from each other.

itaipee May 30, 2024

thanks for the reference , i'm started looking for wav2vec2-BERT ,
It seems same size of Whisper , 580K parameters ( Whisper large is ~1M parameters , right ? )
It was trained on 5M hours , Whisper used ~1M hours ( maybe large-v2/v3 used more , don't remember)

it seems that wav2vec2-BERT has an advantage on low resources languages.

How does wav2vec2-BERT fair with the common languages ? Also , what is it throughput? the transcription speed ? comparing to whisper

phineas-pta May 30, 2024

the numbers are wrong, idk the exact value but u can easily check their paper

the main advantage of w2v2-bert is that it's trained on much higher number of languages so it can support more languages, but the disadvantage is that it requires more data + longer training to have as good accuracy as whisper

from my experience w2v2 is as fast as whisper timy

alvynabranches · 2024-05-30T07:47:33Z

alvynabranches
May 30, 2024
Author

This is if the language has very less training data. Currently I have made a dataset which is around 400-420 hours long. While it be sufficient to train Whisper model? Also that I have the data in 2 scripts, will I be able to train 2 models with 2 different language codes?

2 replies

alvynabranches May 30, 2024
Author

@itaipee @phineas-pta

phineas-pta May 30, 2024

yes & yes

alvynabranches · 2024-05-30T07:55:46Z

alvynabranches
May 30, 2024
Author

Is it that if I send my data to OpenAI, can they train my model and keep it closed until my PhD is done?

3 replies

phineas-pta May 30, 2024

never hear openai provide such service

alvynabranches Jun 13, 2024
Author

I want at least initial Konkani tokens in the model so that I can start my work. Will they help in that?

phineas-pta Jun 13, 2024

very likely no

but u should ask them not me

alvynabranches · 2024-06-05T18:25:48Z

alvynabranches
Jun 5, 2024
Author

The data right now is 50 hours of data has transcription while the others does not have transcription. How to do unsupervised training on Whisper?

4 replies

phineas-pta Jun 6, 2024

no

montesclarosglennbenedict Jun 12, 2024

You probably won't be able to do unsupervised training using data without any transcriptions.

itaipee Jul 31, 2024

@alvynabranches ,
first, 50 hours might be also fine.
here are methods to utilize the rest of the audio. look for "self-supervised speech recognition methods " which is basically tell you to transcribe the audio with the best ASR model you can find, and אישמ train with both sets, usually with over-weighing the smaller manual-transcribed data set .
In you case it might be more problematic , since you will need to transcribe , and then convert Devanagari to Latin , which will increase the error in the training materials.
I suggest to use the 50 hours ( 40 for training , 10 for test ) , and if you see you manage to reduce the WER on the test set to 25% ot below , try to use the fine-tune whisper to transcribe the 350 hours , than re-train the fine-tune model , ( add extra augmentation to the original 40 hours , or make extra epoch with the original 40 hours ) and than , check if you improved the WER on the 10 original test set.

another simple method is to use 2 sources of ASR for the 350 hours , ( fine-tune whisper , + Azure/Google/whatever available) , than compare WER between the 2 sources and than pick only those with lower WER.

gongouveia Aug 1, 2024

@itaipee @alvynabranches As described above you can use my tool (WhisperTemple) to manage the synthetic dataset, improving the whisper transcriptions, and even adding more audio samples. use it using the drag and drop tool and the manage dataset menu

Train Whisper on New Language #2190

Uh oh!

Replies: 5 comments · 14 replies

Uh oh!

Uh oh!

Uh oh!

alvynabranches May 30, 2024 Author

Uh oh!

Uh oh!

Uh oh!

alvynabranches May 30, 2024 Author

Uh oh!

Uh oh!

Uh oh!

alvynabranches May 30, 2024 Author

Uh oh!

alvynabranches May 30, 2024 Author

Uh oh!

Uh oh!

alvynabranches May 30, 2024 Author

Uh oh!

Uh oh!

alvynabranches Jun 13, 2024 Author

Uh oh!

Uh oh!

alvynabranches Jun 5, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 14 replies

alvynabranches May 30, 2024
Author

alvynabranches May 30, 2024
Author

alvynabranches
May 30, 2024
Author

alvynabranches May 30, 2024
Author

alvynabranches
May 30, 2024
Author

alvynabranches Jun 13, 2024
Author

alvynabranches
Jun 5, 2024
Author