You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<b>Bolbosh</b> is a multi-speaker text-to-speech (TTS) system for the Kashmiri language (Persio-Arabic script). Adapting the Matcha-TTS framework—a non-autoregressive, conditional flow matching (CFM) based model—Bolbosh synthesizes natural-sounding Kashmiri speech from text. To the best of our knowledge, this is among the first neural TTS systems for Kashmiri, a low-resource language spoken in the Kashmir region.
119
-
</p>
120
-
<p>
121
-
Our system is trained on 424 speakers by combining the Rasa dataset and IndicVoices corpus. We employ transfer learning from a pre-trained English multi-speaker model (VCTK) and utilize character-level text processing with a custom Kashmiri text normalizer—eliminating the need for a phonemizer. Using an ODE-based inference procedure, Bolbosh enables fast, high-quality synthesis in as few as 10 steps.
118
+
Kashmiri is spoken by around 7 million people, but remains critically underserved in speech technology. despite its official status and rich linguistic heritage. The lack of robust Text-to-Speech (\texttt{TTS}) systems limits digital accessibility and inclusive human-computer interaction for native speakers. In this work, we present the first dedicated, open-source neural \texttt{TTS} system designed for Kashmiri. We show that zero-shot multilingual baselines trained for Indic languages fail to produce intelligible speech, achieving a Mean Opinion Score (MOS) of only 1.86, largely due to inadequate modeling of Perso-Arabic diacritics and language-specific phonotactics.
119
+
To address these limitations, we propose <b>Bolbosh</b>, a supervised cross-lingual adaptation strategy based on Optimal Transport Conditional Flow Matching (OT-CFM) within the Matcha-TTS framework. This enables stable alignment under limited paired data. We further introduce a three-stage acoustic enhancement pipeline consisting of dereverberation, silence trimming, and loudness normalization to unify heterogeneous speech sources and stabilize alignment learning. The models's vocabulary is expanded to explicitly encode Kashmiri graphemes, preserving fine-grained vowel distinctions.
120
+
Our system achieves a MOS of 3.63 and a Mel-Cepstral Distortion (MCD) of 3.73, substantially outperforming multilingual baselines and establishing a new benchmark for Kashmiri speech synthesis. Our results demonstrate that script-aware and supervised flow-based adaptation are critical for low-resource TTS in diacritic-sensitive languages.
These audio samples were synthesized by <b>Bolbosh</b> using our two high-quality Rasa speakers. Generation was performed using a Conditional Flow Matching (CFM) decoder in 10 ODE steps and synthesized into waveforms via a HiFi-GAN vocoder.
133
+
These audio samples were synthesized by <b>Bolbosh</b> using our two high-quality Rasa speakers. Generation was performed using a Conditional Flow Matching (CFM) decoder and synthesized into waveforms via a HiFi-GAN vocoder.
title={Bolbosh: A Multi-Speaker Text-to-Speech System for Kashmiri},
454
+
<pre><code>@inproceedings{ashraf2026bolbosh,
455
+
title={Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech},
457
456
author={Ashraf, Tajamul and Zargar, Burhaan Rasheed and Muizz, Saeed Abdul and Mushtaq, Ifrah and Mehdi, Nazima and Gillani, Iqra Altaf and Kak, Aadil Amin and Bashir, Janibul},
0 commit comments