Skip to content

VITS fails to synthesize intelligible audio from Punjabi dataset using CISAMPA phoneme input #227

@Fatima-Naseem071

Description

@Fatima-Naseem071

Hi! I'm training the VITS model on a Punjabi single-speaker dataset using phoneme-level transcriptions written in Case Insensitive Speech Assessment
Method Phonetic Alphabet (CISAMPA). The training runs without errors, but the synthesized audio is unintelligible — the words do not sound clear or meaningful, and it seems like phonemes are being mispronounced or skipped altogether. This does not happen when I train the model on the default LJSpeech dataset, which works as expected.

My setup:
Dataset: 20 hours of Punjabi single-speaker data
Transcriptions: In CISAMPA (already phonemized)
Cleaner: I’ve defined a custom cleaner that simply returns the input string:

def phoneme_cleaners(text):
return text

Config:
cleaners=["phoneme_cleaners"]
Phonemes are forward slash separated while words are space-separated.
No character-level text is used.

Other configs: Default ljs_base.json

My assumption is that, since the input is already phonemized (in CISAMPA), no text normalization or further cleaning is required — so the cleaner is a pass-through. But I'm unsure if VITS expects phonemes in a specific format, or whether there's some preprocessing that I need to adapt for a non-English phoneme inventory.

Any help or guidance on using custom phoneme inventories with VITS would be greatly appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions