vllm-mlx supports audio processing using mlx-audio, providing:
- STT (Speech-to-Text): Whisper, Parakeet
- TTS (Text-to-Speech): Kokoro, Chatterbox, VibeVoice, VoxCPM
- Audio Processing: SAM-Audio (voice separation)
# Core audio support
pip install mlx-audio>=0.2.9
# Required dependencies for TTS
pip install sounddevice soundfile scipy numba tiktoken misaki spacy num2words loguru phonemizer
# Download spacy English model
python -m spacy download en_core_web_sm
# For non-English TTS (Spanish, French, etc.), install espeak-ng:
# macOS
brew install espeak-ng
# Ubuntu/Debian
# sudo apt-get install espeak-ngOr install all audio dependencies at once:
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng # macOS, for non-English languagesfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Transcribe audio file
with open("audio.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=f,
language="en" # optional
)
print(transcript.text)# Generate speech
audio = client.audio.speech.create(
model="kokoro",
input="Hello, how are you?",
voice="af_heart",
speed=1.0
)
# Save to file
with open("output.wav", "wb") as f:
f.write(audio.content)Isolate voice from background noise, music, or other sounds:
from vllm_mlx.audio import AudioProcessor
# Load SAM-Audio model
processor = AudioProcessor("mlx-community/sam-audio-large-fp16")
processor.load()
# Separate speech from audio
result = processor.separate("meeting_with_music.mp3", description="speech")
# Save isolated voice and background
processor.save(result.target, "voice_only.wav")
processor.save(result.residual, "background_only.wav")CLI Example:
python examples/audio_separation_example.py meeting.mp3 --play
python examples/audio_separation_example.py song.mp3 --description music -o music.wavIsolate drums from a rock song using SAM-Audio:
| Audio | Description | Listen |
|---|---|---|
| Original | "Get Ready" by David Fesliyan (30s, royalty-free) | 🎵 rock_get_ready.mp3 |
| Isolated Drums | Drums extracted by SAM-Audio | 🥁 drums_isolated.wav |
| Without Drums | Track with drums removed | 🎸 rock_no_drums.wav |
# Isolate drums from rock song
python examples/audio_separation_example.py examples/rock_get_ready.mp3 \
--description "drums" \
--output drums_isolated.wav \
--background rock_no_drums.wavPerformance: 30s audio processed in ~20 seconds on M4 Max.
| Model | Alias | Languages | Speed | Quality |
|---|---|---|---|---|
mlx-community/whisper-large-v3-mlx |
whisper-large-v3 |
99+ | Medium | Best |
mlx-community/whisper-large-v3-turbo |
whisper-large-v3-turbo |
99+ | Fast | Great |
mlx-community/whisper-medium-mlx |
whisper-medium |
99+ | Fast | Good |
mlx-community/whisper-small-mlx |
whisper-small |
99+ | Very Fast | OK |
mlx-community/parakeet-tdt-0.6b-v2 |
parakeet |
English | Fastest | Great |
mlx-community/parakeet-tdt-0.6b-v3 |
parakeet-v3 |
English | Fastest | Best |
Recommendation:
- Multilingual:
whisper-large-v3 - English only:
parakeet(3x faster)
| Model | Alias | Size | Languages |
|---|---|---|---|
mlx-community/Kokoro-82M-bf16 |
kokoro |
82M | EN, ES, FR, JA, ZH, HI, IT, PT |
mlx-community/Kokoro-82M-4bit |
kokoro-4bit |
82M | EN, ES, FR, JA, ZH, HI, IT, PT |
Voices (11):
- Female American:
af_heart,af_bella,af_nicole,af_sarah,af_sky - Male American:
am_adam,am_michael - Female British:
bf_emma,bf_isabella - Male British:
bm_george,bm_lewis
Language Codes:
| Code | Language | Code | Language |
|---|---|---|---|
a / en |
English (US) | e / es |
Español |
b / en-gb |
English (UK) | f / fr |
Français |
j / ja |
日本語 | z / zh |
中文 |
i / it |
Italiano | p / pt |
Português |
h / hi |
हिन्दी |
| Model | Alias | Size | Languages |
|---|---|---|---|
mlx-community/chatterbox-turbo-fp16 |
chatterbox |
134M | 15+ languages |
mlx-community/chatterbox-turbo-4bit |
chatterbox-4bit |
134M | 15+ languages |
Supported Languages: EN, ES, FR, DE, IT, PT, RU, JA, ZH, KO, AR, HI, NL, PL, TR
| Model | Alias | Size | Use Case |
|---|---|---|---|
mlx-community/VibeVoice-Realtime-0.5B-4bit |
vibevoice |
200M | Low latency, English |
| Model | Alias | Size | Languages |
|---|---|---|---|
mlx-community/VoxCPM1.5 |
voxcpm |
0.9B | ZH, EN |
mlx-community/VoxCPM1.5-4bit |
voxcpm-4bit |
200M | ZH, EN |
| Model | Size | Use Case |
|---|---|---|
mlx-community/sam-audio-large-fp16 |
3B | Best quality |
mlx-community/sam-audio-large |
3B | Standard |
mlx-community/sam-audio-small-fp16 |
0.6B | Fast |
mlx-community/sam-audio-small |
0.6B | Lightweight |
Transcribe audio to text (OpenAI Whisper API compatible).
Parameters:
file: Audio file (mp3, wav, m4a, webm)model: Model name or aliaslanguage: Language code (optional, auto-detected)response_format:jsonortext
Example:
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.mp3 \
-F model=whisper-large-v3Generate speech from text (OpenAI TTS API compatible).
Parameters:
model: Model name or aliasinput: Text to synthesizevoice: Voice IDspeed: Speech speed (0.5 to 2.0)response_format:wav,mp3
Example:
curl http://localhost:8000/v1/audio/speech \
-d '{"model": "kokoro", "input": "Hello world", "voice": "af_heart"}' \
-H "Content-Type: application/json" \
--output speech.wavList available voices for a model.
Example:
curl http://localhost:8000/v1/audio/voices?model=kokoroReal-time speech-to-text transcription from your microphone:
# Closed captions with whisper-large-v3 (best quality)
python examples/closed_captions.py --language es --chunk 5
# Faster model for lower latency
python examples/closed_captions.py --language en --model whisper-turbo --chunk 3
# Basic mic transcription (record then transcribe)
python examples/mic_transcribe.py --language es
# Real-time chunked transcription
python examples/mic_realtime.py --language es --chunk 3
# Live transcription with voice activity detection
python examples/mic_live.py --language esRequirements:
pip install sounddevice soundfile numpy# Simple TTS example
python examples/tts_example.py "Hello, how are you?" --play
# With different voice
python examples/tts_example.py "Hello!" --voice am_michael --play
# Save to file
python examples/tts_example.py "Welcome to the demo" -o greeting.wav
# List available voices
python examples/tts_example.py --list-voices# English (auto-selects best model)
python examples/tts_multilingual.py "Hello world" --play
# Spanish
python examples/tts_multilingual.py "Hola mundo" --lang es --play
# French
python examples/tts_multilingual.py "Bonjour le monde" --lang fr --play
# Japanese
python examples/tts_multilingual.py "こんにちは" --lang ja --play
# Chinese
python examples/tts_multilingual.py "你好世界" --lang zh --play
# Use specific model
python examples/tts_multilingual.py "Hello" --model chatterbox --play
# List all models
python examples/tts_multilingual.py --list-models
# List all languages
python examples/tts_multilingual.py --list-languagesPre-generated voice samples with native voices for common business use cases:
| Language | Voice | Message | Listen |
|---|---|---|---|
| 🇺🇸 English | af_heart | "Welcome to First National Bank. How may I assist you today?" | |
| 🇪🇸 Spanish | ef_dora | "Gracias por llamar a servicio al cliente. Un agente le atenderá pronto." | |
| 🇫🇷 French | ff_siwis | "Bienvenue. Votre appel est important pour nous." | |
| 🇨🇳 Chinese | zf_xiaobei | "欢迎致电技术支持中心。我们将竭诚为您服务。" |
Generate your own with native voices:
# English - Bank assistant (native voice: af_heart)
python -m mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 \
--text "Welcome to First National Bank. How may I assist you today?" \
--voice af_heart --lang_code a --file_prefix assistant_bank_en
# Spanish - Customer service (native voice: ef_dora)
python -m mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 \
--text "Gracias por llamar a servicio al cliente. Un agente le atendera pronto." \
--voice ef_dora --lang_code e --file_prefix assistant_service_es
# French - Call center (native voice: ff_siwis)
python -m mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 \
--text "Bienvenue. Votre appel est important pour nous." \
--voice ff_siwis --lang_code f --file_prefix assistant_callcenter_fr
# Chinese - Tech support (native voice: zf_xiaobei)
python -m mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 \
--text "欢迎致电技术支持中心。我们将竭诚为您服务。" \
--voice zf_xiaobei --lang_code z --file_prefix assistant_support_zh| Language | Code | Voices |
|---|---|---|
| English (US) | a |
af_heart, af_bella, af_nicole, am_adam, am_michael |
| English (UK) | b |
bf_emma, bf_isabella, bm_george, bm_lewis |
| Spanish | e |
ef_dora, em_alex, em_santa |
| French | f |
ff_siwis |
| Chinese | z |
zf_xiaobei, zf_xiaoni, zf_xiaoxiao, zm_yunjian, zm_yunxi |
| Japanese | j |
jf_alpha, jf_gongitsune, jm_kumo |
| Italian | i |
if_sara, im_nicola |
| Portuguese | p |
pf_dora, pm_alex |
| Hindi | h |
hf_alpha, hf_beta, hm_omega |
from vllm_mlx.audio import STTEngine, TTSEngine, AudioProcessor
# Speech-to-Text
stt = STTEngine("mlx-community/whisper-large-v3-mlx")
stt.load()
result = stt.transcribe("audio.mp3")
print(result.text)
# Text-to-Speech
tts = TTSEngine("mlx-community/Kokoro-82M-bf16")
tts.load()
audio = tts.generate("Hello world", voice="af_heart")
tts.save(audio, "output.wav")
# Voice Separation
processor = AudioProcessor("mlx-community/sam-audio-large-fp16")
processor.load()
result = processor.separate("mixed_audio.mp3", description="speech")
processor.save(result.target, "voice_only.wav")
processor.save(result.residual, "background.wav")from vllm_mlx.audio import transcribe_audio, generate_speech, separate_voice
# Quick transcription
result = transcribe_audio("audio.mp3")
print(result.text)
# Quick TTS
audio = generate_speech("Hello world", voice="af_heart")
# Quick voice separation
voice, background = separate_voice("mixed.mp3")Include audio in chat messages (transcribed automatically):
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Summarize this audio"},
{"type": "audio_url", "audio_url": {"url": "file://meeting.mp3"}}
]
}]
)Tested on Apple M2 Max (32GB).
| Text Length | Audio Duration | Gen Time | RTF | Chars/sec |
|---|---|---|---|---|
| 25 chars | 1.95s | 0.43s | 4.6x | 58.5 |
| 88 chars | 6.00s | 0.32s | 18.6x | 272.4 |
| 117 chars | 7.92s | 0.27s | 29.0x | 427.4 |
Summary:
- Model load time: ~1.0s
- Average RTF: 17.4x (17x faster than real-time)
- Average chars/sec: 252.8
| Model | Load Time | Transcribe (6s audio) | RTF |
|---|---|---|---|
| whisper-small | 0.25s | 0.20s | 30.2x |
| whisper-medium | 18.1s | 0.38s | 15.5x |
| whisper-large-v3 | ~30s | ~0.6s | ~10x |
| parakeet | ~0.5s | ~0.15s | ~40x |
Notes:
- RTF (Real-Time Factor) indicates how many times faster than real-time
- First load includes model download from HuggingFace
- Subsequent loads use cached models
| Use Case | Recommended Model | Why |
|---|---|---|
| Fast English STT | parakeet |
40x RTF, low memory |
| Multilingual STT | whisper-large-v3 |
99+ languages |
| Low-latency STT | whisper-small |
30x RTF, quick load |
| General TTS | kokoro |
17x RTF, good quality |
| Low memory TTS | kokoro-4bit |
4-bit quantized |
- Use Parakeet for English - 40x faster than real-time
- Use 4-bit models for lower memory usage
- Use SAM-Audio small for faster voice separation
- Cache models - engines are lazy-loaded and cached
- Pre-download models to avoid first-run latency
pip install mlx-audio>=0.2.9
Models are downloaded from HuggingFace on first use. Use huggingface-cli download to pre-download:
huggingface-cli download mlx-community/whisper-large-v3-mlx
huggingface-cli download mlx-community/Kokoro-82M-bf16Use smaller models or 4-bit quantized versions:
whisper-small-mlxinstead ofwhisper-large-v3-mlxKokoro-82M-4bitinstead ofKokoro-82M-bf16sam-audio-smallinstead ofsam-audio-large
If you get ValueError: too many values to unpack when using non-English languages (Spanish, Chinese, Japanese, etc.) with Kokoro, apply this fix:
# Fix for mlx_audio/tts/models/kokoro/pipeline.py line 443
# Change:
# ps, _ = self.g2p(chunk)
# To:
g2p_result = self.g2p(chunk)
ps = g2p_result[0] if isinstance(g2p_result, tuple) else g2p_resultOne-liner fix:
python -c "
import os
path = os.path.join(os.path.dirname(__import__('mlx_audio').__file__), 'tts/models/kokoro/pipeline.py')
with open(path, 'r') as f: content = f.read()
old = ' ps, _ = self.g2p(chunk)'
new = ''' # Fix: handle both tuple (en) and string (zh/ja/es) returns from g2p
g2p_result = self.g2p(chunk)
ps = g2p_result[0] if isinstance(g2p_result, tuple) else g2p_result'''
if old in content:
with open(path, 'w') as f: f.write(content.replace(old, new))
print('Fix applied!')
"This bug occurs because English g2p returns a tuple (phonemes, tokens) while other languages return just a string.