VITS fails to synthesize intelligible audio from Punjabi dataset using CISAMPA phoneme input

Hi! I'm training the VITS model on a Punjabi single-speaker dataset using phoneme-level transcriptions written in Case Insensitive Speech Assessment
Method Phonetic Alphabet (CISAMPA). The training runs without errors, but the synthesized audio is unintelligible — the words do not sound clear or meaningful, and it seems like phonemes are being mispronounced or skipped altogether. This does not happen when I train the model on the default LJSpeech dataset, which works as expected.

**My setup:**
Dataset: 20 hours of Punjabi single-speaker data
Transcriptions: In CISAMPA (already phonemized)
Cleaner: I’ve defined a custom cleaner that simply returns the input string:

def phoneme_cleaners(text):
    return text

Config:
cleaners=["phoneme_cleaners"]
Phonemes are forward slash separated while words are space-separated.
No character-level text is used.

Other configs: Default ljs_base.json

My assumption is that, since the input is already phonemized (in CISAMPA), no text normalization or further cleaning is required — so the cleaner is a pass-through. But I'm unsure if VITS expects phonemes in a specific format, or whether there's some preprocessing that I need to adapt for a non-English phoneme inventory.

Any help or guidance on using custom phoneme inventories with VITS would be greatly appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VITS fails to synthesize intelligible audio from Punjabi dataset using CISAMPA phoneme input #227

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

VITS fails to synthesize intelligible audio from Punjabi dataset using CISAMPA phoneme input #227

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions