Replies: 6 comments 6 replies
-
This was an overlook in our data processing, when we imported language labels from the VoxLingua107 dataset, which had three separate labels for the language. The WER scores on Fleurs for the three language labels ranges from 16% to 29%, and we're curious how usable Whisper is for Serbo-Croatian and what kinds of errors it makes. |
Beta Was this translation helpful? Give feedback.
-
I did some informal testing on translation from YouTube videos, interviews, and movie clips as I'm sometimes sent clips in Serbo-Croatian, which I don't really understand myself. I tested with --model small and --model medium and found that autodetect always detects Croatian. If I change it to Serbian, the translation is word by word the same, except the translation sometimes changed "a bit" to "a little" and "between us" to "among us". So I didn't notice the translation become worse when switching to a lower resource language as I first expected. Here's a translation with some differences: Source: https://www.youtube.com/watch?v=B8agRjMdijM Croatian:
Serbian:
There are differences, but rather minor. The Serbian one is perhaps better, and the original audio is also of the Serbian variety. |
Beta Was this translation helpful? Give feedback.
-
I like this proposal, please join those 3 for a better performance! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I've found that the language selection is not really important after all. I've selected Norwegian for a Japanese audio file, and it still translated it correctly. |
Beta Was this translation helpful? Give feedback.
-
any updates? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Very small question: I noticed that you went with 3 seperate languages for Serbo-Croatian (Bosnian, Croatian, and Serbian).
In linguistics contexts (e.g. also on Wiktionary), the language of Serbo-Croatian, sometimes known as BCS (Bosnian/Croatian/Serbian), is not broken into its standard varieties as they are all based on the same subdialect (Eastern Herzegovinian) of the same dialect (Shtokavian) of the same language (Serbo-Croatian).
Why did you break up the language into Bosnian and Croatian and Serbian as that unnecessarily weakens the translation quality?
I would be interested in higher quality translations from Bosnian/Croatian/Serbian speech into English.
Beta Was this translation helpful? Give feedback.
All reactions