Did you train the model with all available Common Voice datasets, even if they are not available languages in whisper? #349

stefangrotz · 2022-10-17T12:26:24Z

stefangrotz
Oct 17, 2022

Hi,

I am part of a group of enthusiasts that build up the Esperanto Dataset on Common Voice, that has more than 1000 validated hours in the dataset version 9 that you used for your training.

Esperanto is quite interesting for machine learning because it has some unique properties:

"one letter = one sound" - the pronunciation is always clear and completely regular
The structure of words is also completely modular and regular without exceptions
At the same time, there is a big variation in the accents of the speakers because most speakers are non-natives and from all over the world

The Esperanto Vosk STT model has a WER of only 7.24 with this relatively small dataset, which is quite astonishing. Esperanto also works great on GPT-3 and to some degree also on GPT-2. The language is part of many other pubic datasets as well.

I tested wisper with some Esperanto files. The English translation already works remarkable well, but the transcription doesn't work well, because the files are always classified as Latin.

Right now, whisper supports the languages defined in tokenizer.py, right?

I have two questions about this:

Did you use all datasets available on Mozilla Common Voice for your training, or just the ones mentioned in your papers and in tokenizer.py?
If you also used the Esperanto dataset, would it be possible to add the language to the supported languages? What steps are necessary to do this?

Best wishes,
Stefan

jongwook · 2022-10-17T18:07:01Z

jongwook
Oct 17, 2022
Maintainer

We used Common Voice dataset only for evaluating and not for training. It's interesting that it can translate Esperanto to English well; I wonder if the model is generalizing from its similarity to other Indo-European languages or learned from actual Esperanto in the training data (despite being trained with wrong language labels). Given the similar-enough phonology and orthography, I suspect you can fine-tune the model to transcribe into proper Esperanto while reusing the latin label. This is an example notebook (can't find the OP 😢) fine-tuning Whisper to transcribe Japanese in Katakana only.

1 reply

glangford Feb 11, 2023

This is an example notebook (can't find the OP 😢) fine-tuning Whisper to transcribe Japanese in Katakana only.

Here is the original post

Finetuning/Training code ? #64

stefangrotz · 2022-10-17T18:54:53Z

stefangrotz
Oct 17, 2022
Author

Both is possible, it could be a generalization, but there is also a lot of wrongly labeled Esperanto content on YouTube, Wikimedia, Librivox,... Plus there is a lot of mixed language content, for example language courses.

Fine-tuning the mode sounds like an interesting idea, I will have a look at this, thanks! I would choose a French or Polish model, though. (French has the most vocabulary overlap with Esperanto and Polish has the most pronunciation overlap with Esperanto)

Is there any chance that the next version of whisper will include more languages? I believe that mixed language models could be a game changer for STT for minority languages. Not only for Esperanto, but also for other small language communities.

2 replies

jongwook Oct 17, 2022
Maintainer

We don't have immediate plans for the next versions but fingers crossed!

jayarjo Feb 11, 2023

🙏

gregdan3 · 2023-01-05T18:41:30Z

gregdan3
Jan 5, 2023

I just discovered this discussion post while playing with Whisper and figuring out what it's capable of. I'm one of the members of the toki pona community, and we'd love to be able to add support for Toki Pona transcription to Whisper, even if it never becomes a core part of Whisper.
Is there currently any method to add languages to the existing model(s) given a sufficient corpus of training data? Ideally something a bit more advanced than the fine-tuning method demonstrated above, since Toki Pona diverges significantly from all the supported languages @jongwook

We added tok to Common Voice last February, and it was represented in the common voice datasets as early as corpus 9!

1 reply

jayarjo Feb 11, 2023

I wonder about the same thing - adding support for a new language.

ahblair7 · 2025-09-05T18:26:56Z

ahblair7
Sep 5, 2025

I would also like to add my request for official Esperanto support for Whisper.

There are close to 2,000 hours now of Esperanto audio/text in Common Voice, and because of the extraordinary regularity of the language this should be enough to train Whisper, so that the next update can achieve an extremely low average WER in the language. Plus, there are other sources of Esperanto audio & text available under open source licensing (eg. LibriVox) for training.

0 replies

Did you train the model with all available Common Voice datasets, even if they are not available languages in whisper? #349

Uh oh!

Uh oh!

stefangrotz Oct 17, 2022

Replies: 4 comments · 4 replies

Uh oh!

jongwook Oct 17, 2022 Maintainer

Uh oh!

glangford Feb 11, 2023

Uh oh!

stefangrotz Oct 17, 2022 Author

Uh oh!

jongwook Oct 17, 2022 Maintainer

Uh oh!

jayarjo Feb 11, 2023

Uh oh!

Uh oh!

gregdan3 Jan 5, 2023

Uh oh!

jayarjo Feb 11, 2023

Uh oh!

ahblair7 Sep 5, 2025

stefangrotz
Oct 17, 2022

Replies: 4 comments 4 replies

jongwook
Oct 17, 2022
Maintainer

stefangrotz
Oct 17, 2022
Author

jongwook Oct 17, 2022
Maintainer

gregdan3
Jan 5, 2023

ahblair7
Sep 5, 2025