Normalization replaces diacritics with whitespace #858

thennal10 · 2023-01-17T09:24:21Z

thennal10
Jan 17, 2023

The basic normalizer seems to replace diacritics with whitespace for a collection of languages that include most Indic languages. For example, in Hindi the text

प्राचीन संस्कृतियों और जनजातियों ने दूध बाल मांस और चमड़े को अपनी आसान पहुंच में बनाए रखना शुरू कर दिया

becomes

प र च न स स क त य और जनज त य न द ध ब ल म स और चमड क अपन आस न पह च म बन ए रखन श र कर द य

and in Malayalam the text

സ്ഥിതിഗതികൾ നേരെയാക്കുന്നതിന് പാർലമെൻ്ററി നിയമ നിർമ്മാണത്തിൻ്റെ അത്യാവശ്യം ഐറിഷ് സർക്കാർ ഊന്നി പറയുന്നു

becomes

സ ഥ ത ഗത കൾ ന ര യ ക ക ന നത ന പ ർലമ ൻ ററ ന യമ ന ർമ മ ണത ത ൻ റ അത യ വശ യ ഐറ ഷ സർക ക ർ ഊന ന പറയ ന ന

This makes the measured WER for these languages more akin to CER, despite words being clearly delimited by space. The culprit seems to be the following function:

def remove_symbols(s: str):
    """
    Replace any other markers, symbols, punctuations with a space, keeping diacritics
    """
    return "".join(
        " " if unicodedata.category(c)[0] in "MSP" else c for c in unicodedata.normalize("NFKC", s)
    )

remove_symbols is used when remove_diacritics is set to false, and the line " " if unicodedata.category(c)[0] in "MSP" replaces all spacing, nonspacing, and enclosing marks with whitespace. Spacing and nonspacing marks contain diacritics for a laundry list of languages, so simply removing the M from MPS seems to work for better normalization in those languages.

Rerunning evaluation for the medium model on FLEURS Tamil test set with non-funky normalization gave me a WER of 56.75, in comparison to 23.1 cited in the paper. I suspect similar differences in WER will be present for most of these languages.

dash8x · 2023-07-26T11:47:38Z

dash8x
Jul 26, 2023

The same applies to Dhivehi as well. The Whisper normalizer removes all vowel diacritics and makes the text unreadable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalization replaces diacritics with whitespace #858

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Normalization replaces diacritics with whitespace #858

Uh oh!

thennal10 Jan 17, 2023

Replies: 1 comment

Uh oh!

dash8x Jul 26, 2023

thennal10
Jan 17, 2023

dash8x
Jul 26, 2023