-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Hi! I'm training the VITS model on a Punjabi single-speaker dataset using phoneme-level transcriptions written in Case Insensitive Speech Assessment
Method Phonetic Alphabet (CISAMPA). The training runs without errors, but the synthesized audio is unintelligible — the words do not sound clear or meaningful, and it seems like phonemes are being mispronounced or skipped altogether. This does not happen when I train the model on the default LJSpeech dataset, which works as expected.
My setup:
Dataset: 20 hours of Punjabi single-speaker data
Transcriptions: In CISAMPA (already phonemized)
Cleaner: I’ve defined a custom cleaner that simply returns the input string:
def phoneme_cleaners(text):
return text
Config:
cleaners=["phoneme_cleaners"]
Phonemes are forward slash separated while words are space-separated.
No character-level text is used.
Other configs: Default ljs_base.json
My assumption is that, since the input is already phonemized (in CISAMPA), no text normalization or further cleaning is required — so the cleaner is a pass-through. But I'm unsure if VITS expects phonemes in a specific format, or whether there's some preprocessing that I need to adapt for a non-English phoneme inventory.
Any help or guidance on using custom phoneme inventories with VITS would be greatly appreciated. Thanks!