Fluid Audio is a Swift SDK for fully local, low-latency audio AI on Apple devices, with inference offloaded to the Apple Neural Engine (ANE), resulting in less memory and generally faster inference.
The SDK includes state-of-the-art speaker diarization, transcription, and voice activity detection via open-source models (MIT/Apache 2.0) that can be integrated with just a few lines of code. Models are optimized for background processing, ambient computing and always on workloads by running inference on the ANE, minimizing CPU usage and avoiding GPU/MPS entirely.
For custom use cases, feedback, additional model support, or platform requests, join our Discord. We’re also bringing visual, language, and TTS models to device and will share updates there.
Below are some featured local AI apps using Fluid Audio models on macOS and iOS:
- Automatic Speech Recognition (ASR): Parakeet TDT v3 (0.6b) for transcription; supports all 25 European languages
- Speaker Diarization: Speaker separation with speaker clustering via Pyannote models
- Speaker Embedding Extraction: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
- Voice Activity Detection (VAD): Voice activity detection with Silero models
- Real-time Processing: Designed for near real-time workloads but also works for offline processing
- Apple Neural Engine: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption
- Open-Source Models: All models are publicly available on HuggingFace — converted and optimized by our team; permissive licenses
Add FluidAudio to your project using Swift Package Manager:
dependencies: [
.package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.4.1"),
],
Important: When adding FluidAudio as a package dependency, only add the library to your target (not the executable). Select FluidAudio
library in the package products dialog and add it to your app target.
DeepWiki for auto-generated docs for this repo.
- Guides
- Modules
- ASR: Getting Started
- ASR: Last Chunk Handling
- Diarization: Speaker Diarization Guide
- VAD: Getting Started
- API
- CLI
The repo is indexed by DeepWiki MCP server, so your coding tool can access the docs:
{
"mcpServers": {
"deepwiki": {
"url": "https://mcp.deepwiki.com/mcp"
}
}
}
For claude code:
claude mcp add -s user -t http deepwiki https://mcp.deepwiki.com/mcp
- Model:
FluidInference/parakeet-tdt-0.6b-v3-coreml
- Languages: All European languages (25) - see Huggingface models for exact list
- Processing Mode: Batch transcription for complete audio files
- Real-time Factor: ~120x on M4 Pro (processes 1 minute of audio in ~0.5 seconds)
- Streaming Support: Coming soon — batch processing is recommended for production use
- Backend: Same Parakeet TDT v3 model powers our backend ASR
import FluidAudio
// Batch transcription from an audio file
Task {
// 1) Initialize ASR manager and load models
let models = try await AsrModels.downloadAndLoad()
let asrManager = AsrManager(config: .default)
try await asrManager.initialize(models: models)
// 2) Prepare 16 kHz mono samples (see: Audio Conversion)
let samples = try await loadSamples16kMono(path: "path/to/audio.wav")
// 3) Transcribe the audio
let result = try await asrManager.transcribe(samples, source: .system)
print("Transcription: \(result.text)")
print("Confidence: \(result.confidence)")
}
# Transcribe an audio file (batch)
swift run fluidaudio transcribe audio.wav
AMI Benchmark Results (Single Distant Microphone) using a subset of the files:
- DER: 17.7% — Competitive with Powerset BCE 2023 (18.5%)
- JER: 28.0% — Outperforms EEND 2019 (25.3%) and x-vector clustering (28.7%)
- RTF: 0.02x — Real-time processing with 50x speedup
import FluidAudio
// Diarize an audio file
Task {
let models = try await DiarizerModels.downloadIfNeeded()
let diarizer = DiarizerManager() // Uses optimal defaults (0.7 threshold = 17.7% DER)
diarizer.initialize(models: models)
// Prepare 16 kHz mono samples (see: Audio Conversion)
let samples = try await loadSamples16kMono(path: "path/to/meeting.wav")
// Run diarization
let result = try diarizer.performCompleteDiarization(samples)
for segment in result.segments {
print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
}
}
For diarization streaming see Documentation/SpeakerDiarization.md
swift run fluidaudio diarization-benchmark --single-file ES2004a \
--chunk-seconds 3 --overlap-seconds 2
# Process an individual file and save JSON
swift run fluidaudio process meeting.wav --output results.json --threshold 0.6
The current VAD APIs require careful tuning for your specific use case. If you need help integrating VAD, reach out in our Discord channel.
Our goal is to provide a streamlined API similar to Apple's upcoming SpeechDetector in OS26.
import FluidAudio
// Programmatic VAD over an audio file
Task {
// 1) Initialize VAD (async load of Silero model)
let vad = try await VadManager(config: VadConfig(threshold: 0.3))
// 2) Prepare 16 kHz mono samples (see: Audio Conversion)
let samples = try await loadSamples16kMono(path: "path/to/audio.wav")
// 3) Run VAD and print speech segments (512-sample frames)
let results = try await vad.processAudioFile(samples)
let sampleRate = 16000.0
let frame = 512.0
var startIndex: Int? = nil
for (i, r) in results.enumerated() {
if r.isVoiceActive {
if startIndex == nil { startIndex = i }
} else if let s = startIndex {
let startSec = (Double(s) * frame) / sampleRate
let endSec = (Double(i + 1) * frame) / sampleRate
print(String(format: "Speech: %.2f–%.2fs", startSec, endSec))
startIndex = nil
}
}
}
# Run VAD benchmark (mini50 dataset by default)
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3
Make a PR if you want to add your app!
App | Description |
---|---|
Voice Ink | Local AI for instant, private transcription with near-perfect accuracy. Uses Parakeet ASR. |
Spokenly | Mac dictation app for fast, accurate voice-to-text; supports real-time dictation and file transcription. Uses Parakeet ASR and speaker diarization. |
Slipbox | Privacy-first meeting assistant for real-time conversation intelligence. Uses Parakeet ASR (iOS) and speaker diarization across platforms. |
Whisper Mate | Transcribes movies and audio locally; records and transcribes in real time from speakers or system apps. Uses speaker diarization. |
- CLI is available on macOS only. For iOS, use the library programmatically.
- Models auto-download on first use. If your network restricts Hugging Face access, set an HTTPS proxy:
export https_proxy=http://127.0.0.1:7890
. - Windows alternative in development: fluid-server
- If you're looking to get the system audio on a Mac, take a look at this repo for reference AudioCap
Apache 2.0 — see LICENSE
for details.
This project builds upon the excellent work of the sherpa-onnx project for speaker diarization algorithms and techniques.
Pyannote: https://github.com/pyannote/pyannote-audio
Wewpeaker: https://github.com/wenet-e2e/wespeaker
Parakeet-mlx: https://github.com/senstella/parakeet-mlx
silero-vad: https://github.com/snakers4/silero-vad