Fine-tuning on a new language? #13

qunash · 2022-09-21T21:16:05Z

qunash
Sep 21, 2022

Hi,

Great work, congrats!

Would it be possible to add a new language to the model by fine-tuning it on my own dataset, or in some other way?

Thanks.

Sep 22, 2022

We haven't tried fine-tuning, but it could be a good avenue for research evaluating Whisper models as pretrained representation for unseen language. We have observed some transfer between linguistically adjacent languages, such as Asturian <-> Spanish (Castillian) or Cebuano <-> Filipino (Tagalog). So if your language of interest has an adjacent language that works acceptably in Whisper, you could fine-tune on your dataset using that language token.

View full answer

jongwook · 2022-09-22T07:31:10Z

jongwook
Sep 22, 2022
Maintainer

We haven't tried fine-tuning, but it could be a good avenue for research evaluating Whisper models as pretrained representation for unseen language. We have observed some transfer between linguistically adjacent languages, such as Asturian <-> Spanish (Castillian) or Cebuano <-> Filipino (Tagalog). So if your language of interest has an adjacent language that works acceptably in Whisper, you could fine-tune on your dataset using that language token.

1 reply

IIIBlueberry Dec 29, 2022

how does the accuracy of other languages affected after fine-tuning the model on single language?

sanchit-gandhi · 2022-11-04T15:21:32Z

sanchit-gandhi
Nov 4, 2022

Check-out this blog for fine-tuning Whisper for multilingual ASR with Hugging Face Transformers: https://huggingface.co/blog/fine-tune-whisper

It provides a step-by-step guide to fine-tuning, right from data preparation to evaluation 🤗 There'a Google Colab so you can also run it as a notebook 😉

4 replies

athu16 Nov 6, 2022

Hi! I tried running your notebook, and it works well till the mapping part.

It's been running this cell for about 30 mins, and I don't know if it's intended. Any idea what could be causing the problem?

sanchit-gandhi Nov 7, 2022

Looks like multiprocessing has gotten stuck - the progress bars show that it only should have taken 5 mins! You can try with num_proc=1 - this will be slower but guaranteed to finish.

sukantan Dec 13, 2022

Hi Sanchit, I followed your fine-tuning post for Hindi. The language Odia ("or") (a low-resource language spoken by 40M people!) isn't in the list of supported languages in tokenizer.py. It is closer to sanskrit bengali and hindi. I tried Odia audio with the demo app from your post detects it to be Italian! Could you suggest an approach to go about fine tuning for Odia. Thanks.

sanchit-gandhi Dec 21, 2022

Hey @sukantan! You can try fine-tuning the Whisper model on Odia to boost it's performance in this language! Here, I'd recommend setting the tokenizer/processor language to the closest language available in Whisper to Odia (maybe Bengali?). Otherwise, the steps for fine-tuning are unchanged!

There are a few Whisper models trained on Odia as part of the Whisper fine-tuning event: https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=mozilla-foundation%2Fcommon_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=or&split=test&metric=wer

It could be worth trying these out to see whether the performance is any better than the zero-shot model!

qunash · 2022-11-05T16:54:55Z

qunash
Nov 5, 2022
Author

Thank you! 🙂
Will try it on an unseen language, and see if that works.

6 replies

qunash Dec 8, 2022
Author

Hi!
Haven't had time to commit to this project yet. Are you also interested in doing something similar?

aurotripathy Dec 8, 2022

Yes! Right now I'm in the Whisper+HF+Lambda fine-tuning event (which goes on for another week) and two people, myself included attempted Odia, a fairly low-resource language (but I speak it).

sanchit-gandhi Dec 12, 2022

Feel free to join us @qunash! We've got over 7 days of the event left to go, so there's plenty of time to get involved and fine-tune a model on a language of your choice 🤗 We're providing scripts, resources and compute, so there's everything you need to participate! See the link for details and info on signing-up: https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event

qunash Dec 12, 2022
Author

@aurotripathy That's super interesting! Would love to know how it went when you're done

@sanchit-gandhi Thanks a lot for the invatation! I'd love to join, but unfortunately I can't right now 😞

AntiDotZA Jan 21, 2023

Feel free to join us @qunash! We've got over 7 days of the event left to go, so there's plenty of time to get involved and fine-tune a model on a language of your choice 🤗 We're providing scripts, resources and compute, so there's everything you need to participate! See the link for details and info on signing-up: https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event

Would love to get involved when you have another event.

mujhenahiata · 2023-02-22T12:22:50Z

mujhenahiata
Feb 22, 2023

Can we fine-tune the whisper to Language identification task, using hugging face @sanchit-gandhi .

1 reply

sanchit-gandhi Mar 3, 2023

Yes! I did it here: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id

Expect a blog post shortly! Also see related PR here: huggingface/transformers#21754

josephwong14wkh · 2023-09-24T10:31:20Z

josephwong14wkh
Sep 24, 2023

Hi, may i know can we fine-tune the whisper on new language for language identification task so that whisper can detect the new language?

0 replies

pr0mila · 2025-04-17T12:01:07Z

pr0mila
Apr 17, 2025

I’ve completed a project where I fine-tuned Whisper-Tiny for translation tasks, and it worked great. You can check out my repo to see the process I followed, and it could help you solve the problem you're encountering.

https://github.com/pr0mila/MediBeng-Whisper-Tiny

0 replies

Fine-tuning on a new language? #13

Uh oh!

Replies: 6 comments · 12 replies

Uh oh!

jongwook Sep 22, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qunash Nov 5, 2022 Author

Uh oh!

qunash Dec 8, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qunash Dec 12, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 12 replies

jongwook
Sep 22, 2022
Maintainer

qunash
Nov 5, 2022
Author

qunash Dec 8, 2022
Author

qunash Dec 12, 2022
Author