Did you train the model with all available Common Voice datasets, even if they are not available languages in whisper? #349
Replies: 4 comments 4 replies
-
We used Common Voice dataset only for evaluating and not for training. It's interesting that it can translate Esperanto to English well; I wonder if the model is generalizing from its similarity to other Indo-European languages or learned from actual Esperanto in the training data (despite being trained with wrong language labels). Given the similar-enough phonology and orthography, I suspect you can fine-tune the model to transcribe into proper Esperanto while reusing the |
Beta Was this translation helpful? Give feedback.
-
Both is possible, it could be a generalization, but there is also a lot of wrongly labeled Esperanto content on YouTube, Wikimedia, Librivox,... Plus there is a lot of mixed language content, for example language courses. Fine-tuning the mode sounds like an interesting idea, I will have a look at this, thanks! I would choose a French or Polish model, though. (French has the most vocabulary overlap with Esperanto and Polish has the most pronunciation overlap with Esperanto) Is there any chance that the next version of whisper will include more languages? I believe that mixed language models could be a game changer for STT for minority languages. Not only for Esperanto, but also for other small language communities. |
Beta Was this translation helpful? Give feedback.
-
I just discovered this discussion post while playing with Whisper and figuring out what it's capable of. I'm one of the members of the toki pona community, and we'd love to be able to add support for Toki Pona transcription to Whisper, even if it never becomes a core part of Whisper. We added |
Beta Was this translation helpful? Give feedback.
-
I would also like to add my request for official Esperanto support for Whisper. There are close to 2,000 hours now of Esperanto audio/text in Common Voice, and because of the extraordinary regularity of the language this should be enough to train Whisper, so that the next update can achieve an extremely low average WER in the language. Plus, there are other sources of Esperanto audio & text available under open source licensing (eg. LibriVox) for training. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am part of a group of enthusiasts that build up the Esperanto Dataset on Common Voice, that has more than 1000 validated hours in the dataset version 9 that you used for your training.
Esperanto is quite interesting for machine learning because it has some unique properties:
The Esperanto Vosk STT model has a WER of only 7.24 with this relatively small dataset, which is quite astonishing. Esperanto also works great on GPT-3 and to some degree also on GPT-2. The language is part of many other pubic datasets as well.
I tested wisper with some Esperanto files. The English translation already works remarkable well, but the transcription doesn't work well, because the files are always classified as Latin.
Right now, whisper supports the languages defined in tokenizer.py, right?
I have two questions about this:
Best wishes,
Stefan
Beta Was this translation helpful? Give feedback.
All reactions