Skip to content

TTS and STS Models to port to MLX-Audio (Roadmap) #1

@Blaizzy

Description

@Blaizzy

Overview

This issue outlines our roadmap for integrating additional text-to-speech (TTS) and speech-to-speech (STS) models into the MLX-Audio library to expand our offerings beyond the current Kokoro model.

Text-to-Speech (TTS) Models

Planned TTS Models

  • Nari Labs Dia 1.6B
  • OuteTTS v1
  • Orpheus
  • BARK
  • SparkTTS 0.5B
  • Sesame CSM-1B
  • IndexTTS
  • ChatterBox
  • VibeVoice
  • VyoTTS
  • MegaTTS
  • Zonos
  • CosyVoice2
  • StyleTTS2
  • Parler TTS
  • ibm-granite/granite-speech-3.2-8b
  • LLMVoX
  • MeloTTS
  • bosonai/higgs-audio-v2

Speech-to-Speech (STS) Models

Planned STS Models

  • Kyutai-Labs Moshi
  • Kyutai-Labs Moshi-vis

Speech-to-text (STT)

  • Whisper
  • Parakeet
  • Wav2vec
  • Voxtral
  • Canary

Technical Considerations

  • All models will need MLX-specific optimizations
  • Quantization support should be implemented for each model
  • Documentation and examples will be created for each new model
  • Performance benchmarks will be established

Instructions:

  1. Select the model and comment below with your selection
  2. Create a Draft PR titled: "Add support for X"
  3. Read Contribution guide
  4. Check existing models
  5. Tag @Blaizzy for code reviews and questions.

Community Input

We welcome community feedback on prioritization and additional model suggestions. Please comment on this issue with your thoughts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions