Native iOS implementation of Kyutai Pocket TTS using Rust/Candle.
This crate provides on-device text-to-speech for iOS using the Kyutai Pocket TTS model. It uses the Candle ML framework for inference and UniFFI for Swift bindings.
Pre-built XCFrameworks are available on the Releases page.
Each release includes:
PocketTTS.xcframework- iOS static library (device + simulator)- Swift bindings and wrapper files
- Integration documentation
See docs/INTEGRATION.md for detailed integration instructions.
A full-featured iOS demo app is included for testing and validation.
The demo app includes:
- Text-to-Speech Synthesis with all 8 built-in voices
- Real-time Resource Monitoring (memory, CPU, thermal state)
- Performance Metrics (synthesis time, audio duration, RTF)
- Waveform Visualization of generated audio
- Audio Export for validation testing
See tests/ios-harness/README.md for setup instructions.
┌─────────────────────────────────────────────────┐
│ Swift/SwiftUI App │
├─────────────────────────────────────────────────┤
│ Generated Swift Bindings (UniFFI) │
├─────────────────────────────────────────────────┤
│ PocketTTSEngine │
├─────────────────────────────────────────────────┤
│ FlowLM │ MLPSampler │ MimiDecoder │
│ (70M) │ (10M) │ (20M) │
└─────────────────────────────────────────────────┘
-
Rust toolchain with iOS targets:
rustup target add aarch64-apple-ios rustup target add aarch64-apple-ios-sim
-
Xcode with iOS SDK
./scripts/build-ios.shThis creates:
target/xcframework/PocketTTS.xcframework- Static librarytarget/xcframework/pocket_tts_ios.swift- Swift bindings
- Drag
PocketTTS.xcframeworkinto your Xcode project - Add
pocket_tts_ios.swiftto your Swift sources - Import and use:
import Foundation
// Initialize engine with model path
let modelPath = Bundle.main.path(forResource: "kyutai-pocket-ios", ofType: nil)!
let engine = try PocketTTSEngine(modelPath: modelPath)
// Configure
let config = TTSConfig(
voiceIndex: 0, // Alba
temperature: 0.7,
topP: 0.9,
speed: 1.0,
consistencySteps: 2,
useFixedSeed: false,
seed: 42
)
try engine.configure(config: config)
// Synthesize
let result = try engine.synthesize(text: "Hello, world!")
// result.audioData contains WAV bytesThe model files should be placed in:
kyutai-pocket-ios/
├── model.safetensors # Main model weights (225MB)
├── tokenizer.model # SentencePiece tokenizer (60KB)
└── voices/ # Voice embeddings (4.2MB)
├── alba.safetensors
├── marius.safetensors
├── javert.safetensors
├── jean.safetensors
├── fantine.safetensors
├── cosette.safetensors
├── eponine.safetensors
└── azelma.safetensors
- 8 Built-in Voices: Alba, Marius, Javert, Jean, Fantine, Cosette, Eponine, Azelma
- Streaming Synthesis: Low-latency audio generation with overlap-add
- Configurable: Temperature, top-p, speed, consistency steps
- CPU Optimized: Designed for efficient CPU inference
- Time to first audio (TTFA): ~200ms
- Real-time factor (RTF): ~3-4x on iPhone 15 Pro
- Memory usage: ~150MB during inference
Run latency tests to validate performance:
./scripts/run-latency-bench.sh --streaming # Measure TTFA
./scripts/run-latency-bench.sh --all # Test both modesSee docs/LATENCY_TESTING.md for detailed benchmarking instructions.
Why: When optimizing a complex ML pipeline like TTS, it's easy to introduce regressions—small changes that degrade speech quality in subtle ways. Without objective measurements, you might only notice quality degradation after it's too late, or worse, ship degraded audio to users.
The Challenge: Getting the last few percentage points of quality requires rigorous validation:
- Is the Rust decoder producing identical output to Python?
- Do optimizations improve or degrade intelligibility?
- Are we introducing noise, distortion, or artifacts?
Our Solution: Comprehensive audio quality metrics with automated regression detection.
We measure five key aspects of TTS output quality:
| Metric | What It Measures | Target |
|---|---|---|
| WER (Word Error Rate) | Intelligibility via Whisper ASR | <5% excellent |
| MCD (Mel-Cepstral Distortion) | Spectral similarity to reference | <6 dB good |
| SNR (Signal-to-Noise Ratio) | Signal health and cleanliness | >25 dB excellent |
| THD (Total Harmonic Distortion) | Audio distortion level | <40% acceptable |
| Spectral (Centroid, Rolloff, Flatness) | Frequency characteristics | Tracked |
Every code change is validated automatically:
- Generate Test Audio - Run full TTS pipeline on standard phrases
- Compute Quality Metrics - Measure all 5 dimensions
- Compare to Baseline - Detect regressions automatically
- Block on Failure - PRs with quality regressions cannot merge
# Run quality check locally
cd validation
python quality_metrics.py \
--audio output.wav \
--text "Hello, this is a test." \
--whisper-model base \
--output-json quality_results.json
# Compare to baseline
python baseline_tracker.py \
--check-regression \
--baseline baselines/baseline_v0.4.1.json \
--metrics quality_results.jsonQuality metrics run automatically in GitHub Actions:
- On Pull Requests: Check for regressions (blocking)
- On Main Branch: Update baseline after successful merge
- Quality Reports: Uploaded as artifacts for every run
See validation/README.md for detailed usage.
Before trusting quality metrics, we validate them against known cases:
- ✅ Run 0 (Meta-validation): Test metrics on synthetic audio with known properties
- ✅ Run 1 (Sanity check): Verify metrics produce reasonable values on real TTS
- 🔄 Run 2 (Cross-validation): Compare Rust vs Python outputs
- 🔄 Run 3 (Stability check): Verify metrics are stable across runs
Only after all validation runs pass do we establish the quality baseline.
Docs:
- validation/docs/QUALITY_METRICS.md - Metric definitions and formulas
- validation/docs/ITERATIVE_VALIDATION.md - Validation process
- validation/docs/REGRESSION_DETECTION.md - Usage guide
This system enables us to:
- Catch regressions early - Before they reach production
- Optimize confidently - Know if changes help or hurt quality
- Track progress - Quantify improvements over time
- Ship with confidence - Every release is validated against baseline
The last few percentage points of quality matter—they're the difference between "good enough" and "production ready."
This project uses comprehensive development infrastructure:
- Pre-commit hooks: rustfmt, clippy, gitleaks, tests
- CI/CD pipelines: Lint, test, coverage, iOS build, security scan
- Code coverage: cargo-tarpaulin with 70% minimum threshold
- AI review: CodeRabbit integration
See docs/quality/QUALITY_PLAN.md for details.
This implementation builds upon excellent work from:
- Kyutai Labs - Original Pocket TTS model architecture and trained weights
- babybirdprd/pocket-tts - Complete Rust/Candle port that made iOS integration possible
- HuggingFace Candle - ML framework for efficient inference
See ATTRIBUTION.md for detailed attribution information.
MIT (code), CC-BY-4.0 (model weights)