Fine-tuning Whisper in more than one language #1432
Replies: 3 comments 18 replies
-
i don't think you can fine tune in one-click but you can do sequentially one language after another |
Beta Was this translation helpful? Give feedback.
-
As long as your language is included in the Whisper language, it will be correctly encoded and decoded, so yes, it is language independent. Regarding the self.processor.tokenizer.batch_decode, it is used when computing the metrics for the ASR task, so it is correct to skip special tokens (you only want to compute the metric of the ASR task). |
Beta Was this translation helpful? Give feedback.
-
I’ve created a code-switched language dataset for fine-tuning Whisper, including audio data along with CSV and Parquet files, which I’ve stored on Hugging Face. After preparing the dataset, I fine-tuned the model for translation. You can explore the entire end-to-end project in my repo. Here’s the link to check it out: https://github.com/pr0mila/MediBeng-Whisper-Tiny |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Suppose I have a dataset in two or more languages (one of them under-represented in Whisper's pre-trained models), and I want to fine-tune those 2 or more languages to continue with a multilingual model and avoid catastrophic forgetting. Is fine-tuning possible?
Can I define the tokenizer and the processor without indicating the language?
Beta Was this translation helpful? Give feedback.
All reactions