Finetuning Whisper for translation tasks #1646

WassayS · 2023-09-09T04:17:40Z

WassayS
Sep 9, 2023

I have an audio dataset of specific domain in the Hindi language, and I want to enhance the whisper translation capabilities of my model. Currently, it can take non-English audio input and translate it into English text.

I understand how to fine-tune the whisper model for transcription tasks like writing same language text as in audio but I'm not sure how to fine-tune it specifically for cross-lingual translation when audio is in another language and we want to improve translation to English performance of whisper model. Could you provide guidance on how to fine-tune the model for this purpose or share any repo?

AmgadHasan · 2024-01-23T00:09:46Z

AmgadHasan
Jan 23, 2024

Hi.
Did you have any success with this?

0 replies

emanueleielo · 2024-02-04T12:09:04Z

emanueleielo
Feb 4, 2024

I have an audio dataset of specific domain in the Hindi language, and I want to enhance the whisper translation capabilities of my model. Currently, it can take non-English audio input and translate it into English text.

I understand how to fine-tune the whisper model for transcription tasks like writing same language text as in audio but I'm not sure how to fine-tune it specifically for cross-lingual translation when audio is in another language and we want to improve translation to English performance of whisper model. Could you provide guidance on how to fine-tune the model for this purpose or share any repo?

I read somewhere that to fine tune on this task you can follow this guide fine tuning whisper and just change the dataset and set translation as task instead of transcribe.

I need to do the same but I still didn't tried. My worry is the calculation of WER, how can the wer be calculated in the task of translation? There are a lot of possibilities that the text predicted will have the same meaning as the real text but with differents and then the WER will mislead.

Did you manage it?

0 replies

AmgadHasan · 2024-02-15T15:26:14Z

AmgadHasan
Feb 15, 2024

I read somewhere that to fine tune on this task you can follow this guide fine tuning whisper and just change the dataset and set translation as task instead of transcribe.

I need to do the same but I still didn't tried. My worry is the calculation of WER, how can the wer be calculated in the task of translation? There are a lot of possibilities that the text predicted will have the same meaning as the real text but with differents and then the WER will mislead.

Did you manage it?

@emanueleielo
Yes, I managed to successfully fine-tune whisper for translation.

For evaluation, WER isn't a good metric for translation. You want to use one of the translation metrics like BLEU Score, METEOR, COMET or similar.

Hope that helps!

0 replies

rishikksh20 · 2024-05-12T07:46:52Z

rishikksh20
May 12, 2024

Hi @AmgadHasan
Did you get better results with Fine tuning for non-English to English translation compared to normal non finetune whisper model?

2 replies

AmgadHasan May 13, 2024

Hi @rishikksh20
Yes. I have seen significant improvement in the model after fine-tuning. The translation accuracy is much better and there is less hallucination.

dgoryeo May 13, 2024

@AmgadHasan , this sounds very promising. In your experience, how many additional hours of (audio) data set did you need to achieve the improvement?

EmreOzkose · 2024-12-11T13:22:52Z

EmreOzkose
Dec 11, 2024

Hi, @AmgadHasan , can you share how you prepare custom data?

For example, let's say language pair is en->hi.

print(custom_en_hi_dataloader["train"][0])

would be

{'audio': {'path': 'path_to_en.wav', 
           'array': ...,
           'sampling_rate': 16000},
 'sentence': 'खीर की मिठास पर गरमाई बिहार की सियासत, कुशवाहा ने दी सफाई'}

right?

1 reply

FOLSc Jan 14, 2025

Hi @AmgadHasan,

I noticed your example of preparing custom data for the en->hi language pair. I'm just wondering if this approach you mentioned actually works well in practice for preparing the data? Also, have you successfully implemented English to other languages translation using this kind of data preparation method? I'm really curious about the results and any insights you might have on this. Thanks!

pr0mila · 2025-04-17T11:59:53Z

pr0mila
Apr 17, 2025

I worked on a project where I fine-tuned Whisper-Tiny for translation tasks, and it worked well. You can check out my repo to see how I did it, and it might help you fix the issue you're having. I have mentioned all the steps in README.md.

MediBeng-Whisper-Tiny on Github

8 replies

pr0mila Apr 18, 2025

Thanks for the insight. I have not done any model training, but have been very keen to do it some day. I also saw your ParquetToHuggingFace repo. Quite helpful approach, in my view.

Thanks a lot! I made that repo because I was having trouble loading my dataset in Parquet format. Glad you found it useful! If you ever want to try out model training or need help with the repo, just give me a shout.

dgoryeo Apr 18, 2025

Thank you so much. Will do!

younghounSon Jun 27, 2025

Hello @pr0mila! I have a question. For example, when translating Chinese voice -> English text, should I set the processor as < language = english task = translate > ? Or should I set it as < language = chinese task = translate> ?
Thank you

pr0mila Jun 27, 2025

Hi @younghounSon! In my case, I used Bengali-English code-mixed speech, and the output was English. So I set the language as english and the task as translate. That worked well for me.
You can check my repository as reference that I mentioned earlier.
https://github.com/pr0mila/MediBeng-Whisper-Tiny/blob/main/config/config.py

Hope it helps!

younghounSon Jun 27, 2025

Thank you so much for your help! It was very helpful.

diaselma · 2025-06-26T10:56:18Z

diaselma
Jun 26, 2025

Hello,

I’m currently working on a translation project involving Wolof, an underrepresented African language.

I have a dataset consisting of audio files in Wolof and their corresponding translations in French.
However, I do not have transcriptions of the audio in the original (Wolof) language.

My question is:
Is it possible to fine-tune the Whisper model using only these (audio, translated text) pairs for the translation task, without providing transcriptions in the source language?

I would like to confirm whether such a fine-tuning approach is technically valid with Whisper (e.g., using task="translate"), and whether others have had success with similar setups.

Thank you very much for your support and insights 🙏

1 reply

pr0mila Jun 26, 2025

Hi,

Yes, it’s possible to fine-tune Whisper using only (audio, translated text) pairs for translation tasks without source language transcriptions. I did something similar in my MediBeng Whisper project, where I fine-tuned Whisper for translation.

Both the dataset and model are open-source, so you can check them to get an idea. I used task="translate" during fine-tuning, and it worked well even without transcriptions in the original language.

Project link: https://github.com/pr0mila/MediBeng-Whisper-Tiny
Dataset link: https://huggingface.co/datasets/pr0mila-gh0sh/MediBeng

Let me know if you’d like the links or more details. Happy to help!

Best,
Promila

Finetuning Whisper for translation tasks #1646

Uh oh!

Uh oh!

Replies: 7 comments · 12 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 7 comments 12 replies