MARS8: The world's first family of TTS models
MARS8 achieves state-of-the-art speech quality and speaker similarity in text-to-speech synthesis, excelling in challenging real-world voice cloning scenarios with minimal reference audio.
Evaluated head-to-head against leading TTS systems — Cartesia Sonic-3, ElevenLabs Multilingual v2/v3, and Minimax Speech-2.6-HD — MARS8 delivers top-tier results across all key metrics while maintaining exceptional voice fidelity from references as short as 2 seconds.
View detailed metrics
| Metric | MARS8-Pro | MARS8-Flash | Sonic-3 | Speech-2.6-HD | Multilingual v2 | Multilingual v3 |
|---|---|---|---|---|---|---|
| CE ↑ | 5.43 | 5.43 | 5.04 | 4.99 | 5.41 | 5.18 |
| PQ ↑ | 7.45 | 7.45 | 6.95 | 6.95 | 7.45 | 7.19 |
| Metric | MARS8-Pro | MARS8-Flash | Sonic-3 | Speech-2.6-HD | Multilingual v2 | Multilingual v3 |
|---|---|---|---|---|---|---|
| CER ↓ | 5.77% | 5.67% | 8.54% | 11.30% | 4.39% | 14.62% |
| Metric | MARS8-Pro | MARS8-Flash | Sonic-3 | Speech-2.6-HD | Multilingual v2 | Multilingual v3 |
|---|---|---|---|---|---|---|
| Wavlm-base-sv (cosine similarity) ↑ | 0.8676 | 0.8666 | 0.8420 | 0.8666 | 0.8109 | 0.8253 |
| CAM++ embedding (cosine similarity) ↑ | 0.7097 | 0.7066 | 0.5134 | 0.5878 | 0.3912 | 0.336 |
Key finding: MARS8 achieves state-of-the-art speaker similarity even with audio references as short as 2 seconds — a critical advantage for real-world applications where long, clean reference audio is rarely available.
To validate these claims, we developed the MAMBA Benchmark, a rigorous stress test designed to reflect the most demanding real-world conditions, rather than idealized studio environments.
The name MAMBA is intentional. Our team at CAMB deeply resonates with the mamba mentality: a relentless commitment to excellence, discipline, and continuous improvement. Kobe Bryant's legacy stands as a powerful testament to what sustained hard work and focus can achieve, even when starting as an underdog. In the same spirit, the MAMBA Benchmark embodies difficulty by design, prioritizing the hardest cases, not the easiest ones.
Today, we are open-sourcing the MAMBA Benchmark so the broader community can independently replicate and validate our results. Our goal is for MAMBA to serve not only as a transparent validation framework for our own models, but also as a durable, industry-grade benchmark against which future TTS systems can be evaluated.
| Statistic | Value |
|---|---|
| Total samples | 1,334 |
| Cross-language pairs | 70% |
| Average reference duration | 2.3s |
| Most common reference length | 2.0s |
| Total source audio | 101 min |
| Speech-only segments | 51 min |
Traditional TTS benchmarks rely on clean, long-form reference audio in controlled conditions. MAMBA challenges this by introducing:
- Cross-language voice cloning — 70% of samples require cloning across different languages, testing pronunciation robustness and identity preservation
- Ultra-short references — Average reference duration of just 2.3 seconds mirrors real-world constraints
- Expressive source audio — References contain natural expressiveness rather than neutral read speech
-
Create a Camb.ai account — Sign up at camb.ai and generate an API key from your dashboard.
-
Install FFmpeg
# Ubuntu/Debian apt update && apt install -y ffmpeg # macOS brew update && brew install ffmpeg # Windows winget install ffmpeg
-
Set up Python environment
python3 -m venv venv source ./venv/bin/activate # Linux/macOS # or .\venv\Scripts\activate # Windows
pip install -r requirements.txtStep 1: Load audio data
python load_audio.pyThis downloads and extracts audio from the sources defined in teasers.json.
Step 2: Clean and segment audio
python load_segments.pyThis removes background noise using UVR-MDX-NET and splits audio into segments based on subtitle timing. Cleaned segments are saved to ./segments/.
MARS8 demonstrates consistent superiority across the evaluation dimensions that matter most for production deployments:
| Capability | MARS8 Advantage |
|---|---|
| Minimal reference requirements | High-fidelity cloning from 2s audio |
| Cross-language robustness | Strong performance on 70% cross-lingual test set |
| Pronunciation accuracy | 5.67% CER on multilingual content |
| Speaker identity preservation | 0.87 Wavlm-base-sv / 0.71 CAM++ embedding similarity scores |
All evaluations follow standardized protocols to ensure reproducibility:
| Metric | Method |
|---|---|
| Speaker similarity | Wavlm-base-sv (cosine similarity) and CAM++ embedding (cosine similarity) speaker verification models |
| Transcription accuracy | Character Error Rate (CER) via Whisper ASR |
| Quality assessment | CE and PQ scores via Facebook audio-aesthetics model |
The evaluation data, cleaning pipeline, and metric definitions are fully open-sourced.
| # | System | Link |
|---|---|---|
| 1 | Cartesia Sonic-3 | cartesia.ai |
| 2 | ElevenLabs Multilingual v2/v3 | elevenlabs.io |
| 3 | Minimax Speech-2.6-HD | minimax.io |
If you use this benchmark in your research, please cite:
@misc{mars8_2026,
title = {MARS8: State-of-the-art Text-to-Speech with Minimal Reference Audio},
author = {Camb.ai},
year = {2026},
note = {Evaluated on the MAMBA Benchmark},
url = {https://github.com/Camb-ai/MAMBA-BENCHMARK}
}This project is licensed under the MIT License — see LICENSE for details.


