Skip to content

Camb-ai/MAMBA-BENCHMARK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MARS8 Benchmark Results

MARS8: The world's first family of TTS models

Overview  Performance  Benchmark  Getting Started  Methodology  Citation  Discord


MARS8 Benchmark Comparison


Overview

MARS8 achieves state-of-the-art speech quality and speaker similarity in text-to-speech synthesis, excelling in challenging real-world voice cloning scenarios with minimal reference audio.

Evaluated head-to-head against leading TTS systems — Cartesia Sonic-3, ElevenLabs Multilingual v2/v3, and Minimax Speech-2.6-HD — MARS8 delivers top-tier results across all key metrics while maintaining exceptional voice fidelity from references as short as 2 seconds.


Performance

Full Benchmark Results

View detailed metrics

Speech Quality

Metric MARS8-Pro MARS8-Flash Sonic-3 Speech-2.6-HD Multilingual v2 Multilingual v3
CE 5.43 5.43 5.04 4.99 5.41 5.18
PQ 7.45 7.45 6.95 6.95 7.45 7.19

Voice Cloning Accuracy

Metric MARS8-Pro MARS8-Flash Sonic-3 Speech-2.6-HD Multilingual v2 Multilingual v3
CER 5.77% 5.67% 8.54% 11.30% 4.39% 14.62%

Speaker Similarity

Metric MARS8-Pro MARS8-Flash Sonic-3 Speech-2.6-HD Multilingual v2 Multilingual v3
Wavlm-base-sv (cosine similarity) 0.8676 0.8666 0.8420 0.8666 0.8109 0.8253
CAM++ embedding (cosine similarity) 0.7097 0.7066 0.5134 0.5878 0.3912 0.336

Key finding: MARS8 achieves state-of-the-art speaker similarity even with audio references as short as 2 seconds — a critical advantage for real-world applications where long, clean reference audio is rarely available.


Benchmark

MAMBA: The "Kobe Bryant" of TTS Benchmarks

To validate these claims, we developed the MAMBA Benchmark, a rigorous stress test designed to reflect the most demanding real-world conditions, rather than idealized studio environments.

The name MAMBA is intentional. Our team at CAMB deeply resonates with the mamba mentality: a relentless commitment to excellence, discipline, and continuous improvement. Kobe Bryant's legacy stands as a powerful testament to what sustained hard work and focus can achieve, even when starting as an underdog. In the same spirit, the MAMBA Benchmark embodies difficulty by design, prioritizing the hardest cases, not the easiest ones.

Today, we are open-sourcing the MAMBA Benchmark so the broader community can independently replicate and validate our results. Our goal is for MAMBA to serve not only as a transparent validation framework for our own models, but also as a durable, industry-grade benchmark against which future TTS systems can be evaluated.

Statistic Value
Total samples 1,334
Cross-language pairs 70%
Average reference duration 2.3s
Most common reference length 2.0s
Total source audio 101 min
Speech-only segments 51 min

Why MAMBA?

Traditional TTS benchmarks rely on clean, long-form reference audio in controlled conditions. MAMBA challenges this by introducing:

  • Cross-language voice cloning — 70% of samples require cloning across different languages, testing pronunciation robustness and identity preservation
  • Ultra-short references — Average reference duration of just 2.3 seconds mirrors real-world constraints
  • Expressive source audio — References contain natural expressiveness rather than neutral read speech

Getting Started

Prerequisites

  1. Create a Camb.ai account — Sign up at camb.ai and generate an API key from your dashboard.

  2. Install FFmpeg

    # Ubuntu/Debian
    apt update && apt install -y ffmpeg
    
    # macOS
    brew update && brew install ffmpeg
    
    # Windows
    winget install ffmpeg
  3. Set up Python environment

    python3 -m venv venv
    source ./venv/bin/activate  # Linux/macOS
    # or
    .\venv\Scripts\activate     # Windows

Installation

pip install -r requirements.txt

Running the Benchmark

Step 1: Load audio data

python load_audio.py

This downloads and extracts audio from the sources defined in teasers.json.

Step 2: Clean and segment audio

python load_segments.py

This removes background noise using UVR-MDX-NET and splits audio into segments based on subtitle timing. Cleaned segments are saved to ./segments/.


Results

MARS8 demonstrates consistent superiority across the evaluation dimensions that matter most for production deployments:

Capability MARS8 Advantage
Minimal reference requirements High-fidelity cloning from 2s audio
Cross-language robustness Strong performance on 70% cross-lingual test set
Pronunciation accuracy 5.67% CER on multilingual content
Speaker identity preservation 0.87 Wavlm-base-sv / 0.71 CAM++ embedding similarity scores

Methodology

All evaluations follow standardized protocols to ensure reproducibility:

Metric Method
Speaker similarity Wavlm-base-sv (cosine similarity) and CAM++ embedding (cosine similarity) speaker verification models
Transcription accuracy Character Error Rate (CER) via Whisper ASR
Quality assessment CE and PQ scores via Facebook audio-aesthetics model

The evaluation data, cleaning pipeline, and metric definitions are fully open-sourced.


References

# System Link
1 Cartesia Sonic-3 cartesia.ai
2 ElevenLabs Multilingual v2/v3 elevenlabs.io
3 Minimax Speech-2.6-HD minimax.io

Citation

If you use this benchmark in your research, please cite:

@misc{mars8_2026,
  title   = {MARS8: State-of-the-art Text-to-Speech with Minimal Reference Audio},
  author  = {Camb.ai},
  year    = {2026},
  note    = {Evaluated on the MAMBA Benchmark},
  url     = {https://github.com/Camb-ai/MAMBA-BENCHMARK}
}

License

This project is licensed under the MIT License — see LICENSE for details.


camb.ai

About

Data used in evaluation of MARS8 TTS

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages