Replies: 1 comment
-
The same applies to Dhivehi as well. The Whisper normalizer removes all vowel diacritics and makes the text unreadable. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The basic normalizer seems to replace diacritics with whitespace for a collection of languages that include most Indic languages. For example, in Hindi the text
becomes
and in Malayalam the text
becomes
This makes the measured WER for these languages more akin to CER, despite words being clearly delimited by space. The culprit seems to be the following function:
remove_symbols
is used whenremove_diacritics
is set to false, and the line" " if unicodedata.category(c)[0] in "MSP"
replaces all spacing, nonspacing, and enclosing marks with whitespace. Spacing and nonspacing marks contain diacritics for a laundry list of languages, so simply removing theM
fromMPS
seems to work for better normalization in those languages.Rerunning evaluation for the medium model on FLEURS Tamil test set with non-funky normalization gave me a WER of
56.75
, in comparison to23.1
cited in the paper. I suspect similar differences in WER will be present for most of these languages.Beta Was this translation helpful? Give feedback.
All reactions