This repo is a personal laboratory for training autoregressive text-audio models.
Assume everything will change; quality right now is pretty mid. Will get better.
A distillation of Kokoro TTS to the RQ Transformer architecture. Released at 70M and 150M scale.
For MLX inference on Apple Silicon, you'll need a working Python installation. See the mlx_inference
folder for setup docs!
# tl;dr
uvx --from smoltts_mlx smoltts-server
Candle.rs docs coming soon.
As of Feb 2025, this project currently uses the Mimi pretrained codec by Kyutai, due to its low framerate (12.5Hz), high compression ratio, and streaming support.
projectgutenberg-kokoro_v1-mimi:
- ~5500 hours of synthetic audio generated with Kokoro v1 for US and UK English.
- 3 million utterances of sentences from Project Gutenberg, mostly 3-15s. 3.29GB compressed with Mimi.
- 11 speakers.
For convenience, we serialize popular open TTS benchmark datasets in Mimi, to directly have training targets and compress the filesize by ~500x:
- LibriTTS-R encoded with Mimi codec. ~460 hours of data.
Unfortunately, HuggingFace Datasets using audio columns require librosa, which has a hard Python 3.9 dependency for inexplicable reasons. If you are not creating a new dataset using raw audio instead of Mimi codes, please feel free to ignore this.
Please use uv.
# If you are not making new audio datasets, feel free to use a sane Python version instead
uv sync
uv pip install -e .
Create a .env
file and add:
HUGGINGFACE_TOKEN=sk-placeholder
For the dataset and init, see data_pipeline/README.md
.
This architecture is most popularly used as the neural codec seq2seq backbone for:
- Fish Speech TTS (in their paper as "DualAR" or dual-autoregressive)
- Kyutai's Moshi model early in pretraining before adaptation to duplex audio.
Models trained here will be compatible with my DualAR fish-speech.rs inference engine.