Skip to content

Commit 64952fa

Browse files
committed
Add audio support with native multilingual TTS
- Integrate mlx-audio for STT/TTS (Whisper, Parakeet, Kokoro) - Add native voice examples: English, Spanish, French, Chinese - Include mlx-audio 0.2.9 multilingual bug fix in docs - Update description: vLLM-like inference for Text, Image, Video & Audio
1 parent 4e01e67 commit 64952fa

File tree

14 files changed

+1986
-3
lines changed

14 files changed

+1986
-3
lines changed

README.md

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# vLLM-MLX
22

3-
**Apple Silicon MLX Backend for vLLM** - GPU-accelerated LLM inference on Mac
3+
**vLLM-like inference for Apple Silicon** - GPU-accelerated Text, Image, Video & Audio on Mac
44

55
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
66
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
@@ -14,11 +14,13 @@ vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:
1414
- **[MLX](https://github.com/ml-explore/mlx)**: Apple's ML framework with unified memory and Metal kernels
1515
- **[mlx-lm](https://github.com/ml-explore/mlx-lm)**: Optimized LLM inference with KV cache and quantization
1616
- **[mlx-vlm](https://github.com/Blaizzy/mlx-vlm)**: Vision-language models for multimodal inference
17+
- **[mlx-audio](https://github.com/Blaizzy/mlx-audio)**: Speech-to-Text and Text-to-Speech with native voices
1718

1819
## Features
1920

21+
- **Multimodal** - Text, Image, Video & Audio in one platform
2022
- **Native GPU acceleration** on Apple Silicon (M1, M2, M3, M4)
21-
- **Vision-language models** - image, video, and audio understanding
23+
- **Native TTS voices** - Spanish, French, Chinese, Japanese + 5 more languages
2224
- **OpenAI API compatible** - drop-in replacement for OpenAI client
2325
- **MCP Tool Calling** - integrate external tools via Model Context Protocol
2426
- **Paged KV Cache** - memory-efficient caching with prefix sharing
@@ -77,6 +79,35 @@ response = client.chat.completions.create(
7779
)
7880
```
7981

82+
### Audio (TTS/STT)
83+
84+
```bash
85+
# Install audio dependencies
86+
pip install vllm-mlx[audio]
87+
python -m spacy download en_core_web_sm
88+
brew install espeak-ng # macOS, for non-English languages
89+
```
90+
91+
```bash
92+
# Text-to-Speech (English)
93+
python examples/tts_example.py "Hello, how are you?" --play
94+
95+
# Text-to-Speech (Spanish)
96+
python examples/tts_multilingual.py "Hola mundo" --lang es --play
97+
98+
# List available models and languages
99+
python examples/tts_multilingual.py --list-models
100+
python examples/tts_multilingual.py --list-languages
101+
```
102+
103+
**Supported TTS Models:**
104+
| Model | Languages | Description |
105+
|-------|-----------|-------------|
106+
| Kokoro | EN, ES, FR, JA, ZH, IT, PT, HI | Fast, 82M params, 11 voices |
107+
| Chatterbox | 15+ languages | Expressive, voice cloning |
108+
| VibeVoice | EN | Realtime, low latency |
109+
| VoxCPM | ZH, EN | High quality Chinese/English |
110+
80111
## Documentation
81112

82113
For full documentation, see the [docs](docs/) directory:
@@ -89,6 +120,7 @@ For full documentation, see the [docs](docs/) directory:
89120
- [OpenAI-Compatible Server](docs/guides/server.md)
90121
- [Python API](docs/guides/python-api.md)
91122
- [Multimodal (Images & Video)](docs/guides/multimodal.md)
123+
- [Audio (STT/TTS)](docs/guides/audio.md)
92124
- [MCP & Tool Calling](docs/guides/mcp-tools.md)
93125
- [Continuous Batching](docs/guides/continuous-batching.md)
94126

0 commit comments

Comments
 (0)