FluidAudio - Speaker diarization, voice-activity-detection and transcription with CoreML

Fluid Audio is a Swift SDK for fully local, low-latency audio AI on Apple devices, with inference offloaded to the Apple Neural Engine (ANE), resulting in less memory and generally faster inference.

The SDK includes state-of-the-art speaker diarization, transcription, and voice activity detection via open-source models (MIT/Apache 2.0) that can be integrated with just a few lines of code. Models are optimized for background processing, ambient computing and always on workloads by running inference on the ANE, minimizing CPU usage and avoiding GPU/MPS entirely.

For custom use cases, feedback, additional model support, or platform requests, join our Discord. We’re also bringing visual, language, and TTS models to device and will share updates there.

Below are some featured local AI apps using Fluid Audio models on macOS and iOS:

Highlights

Automatic Speech Recognition (ASR): Parakeet TDT v3 (0.6b) for transcription; supports all 25 European languages
Speaker Diarization: Speaker separation with speaker clustering via Pyannote models
Speaker Embedding Extraction: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
Voice Activity Detection (VAD): Voice activity detection with Silero models
Real-time Processing: Designed for near real-time workloads but also works for offline processing
Apple Neural Engine: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption
Open-Source Models: All models are publicly available on HuggingFace — converted and optimized by our team; permissive licenses

Installation

Add FluidAudio to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.4.1"),
],

Important: When adding FluidAudio as a package dependency, only add the library to your target (not the executable). Select FluidAudio library in the package products dialog and add it to your app target.

Documentation

DeepWiki for auto-generated docs for this repo.

Documentation Index

Guides
- MCP
- Audio Conversion for Inference
Modules
- ASR: Getting Started
- ASR: Last Chunk Handling
- Diarization: Speaker Diarization Guide
- VAD: Getting Started
API
- API Reference
CLI
- Command Line Guide

MCP Server

The repo is indexed by DeepWiki MCP server, so your coding tool can access the docs:

{
  "mcpServers": {
    "deepwiki": {
      "url": "https://mcp.deepwiki.com/mcp"
    }
  }
}

For claude code:

claude mcp add -s user -t http deepwiki https://mcp.deepwiki.com/mcp

Automatic Speech Recognition (ASR) / Transcription

Model: FluidInference/parakeet-tdt-0.6b-v3-coreml
Languages: All European languages (25) - see Huggingface models for exact list
Processing Mode: Batch transcription for complete audio files
Real-time Factor: ~120x on M4 Pro (processes 1 minute of audio in ~0.5 seconds)
Streaming Support: Coming soon — batch processing is recommended for production use
Backend: Same Parakeet TDT v3 model powers our backend ASR

ASR Quick Start

import FluidAudio

// Batch transcription from an audio file
Task {
    // 1) Initialize ASR manager and load models
    let models = try await AsrModels.downloadAndLoad()
    let asrManager = AsrManager(config: .default)
    try await asrManager.initialize(models: models)

    // 2) Prepare 16 kHz mono samples (see: Audio Conversion)
    let samples = try await loadSamples16kMono(path: "path/to/audio.wav")

    // 3) Transcribe the audio
    let result = try await asrManager.transcribe(samples, source: .system)
    print("Transcription: \(result.text)")
    print("Confidence: \(result.confidence)")
}

# Transcribe an audio file (batch)
swift run fluidaudio transcribe audio.wav

Speaker Diarization

AMI Benchmark Results (Single Distant Microphone) using a subset of the files:

DER: 17.7% — Competitive with Powerset BCE 2023 (18.5%)
JER: 28.0% — Outperforms EEND 2019 (25.3%) and x-vector clustering (28.7%)
RTF: 0.02x — Real-time processing with 50x speedup

Speaker Diarization Quick Start

import FluidAudio

// Diarize an audio file
Task {
    let models = try await DiarizerModels.downloadIfNeeded()
    let diarizer = DiarizerManager()  // Uses optimal defaults (0.7 threshold = 17.7% DER)
    diarizer.initialize(models: models)

    // Prepare 16 kHz mono samples (see: Audio Conversion)
    let samples = try await loadSamples16kMono(path: "path/to/meeting.wav")

    // Run diarization
    let result = try diarizer.performCompleteDiarization(samples)
    for segment in result.segments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}

For diarization streaming see Documentation/SpeakerDiarization.md

swift run fluidaudio diarization-benchmark --single-file ES2004a \
  --chunk-seconds 3 --overlap-seconds 2

CLI

# Process an individual file and save JSON
swift run fluidaudio process meeting.wav --output results.json --threshold 0.6

Voice Activity Detection (VAD)

The current VAD APIs require careful tuning for your specific use case. If you need help integrating VAD, reach out in our Discord channel.

Our goal is to provide a streamlined API similar to Apple's upcoming SpeechDetector in OS26.

VAD Quick Start

import FluidAudio

// Programmatic VAD over an audio file
Task {
    // 1) Initialize VAD (async load of Silero model)
    let vad = try await VadManager(config: VadConfig(threshold: 0.3))

    // 2) Prepare 16 kHz mono samples (see: Audio Conversion)
    let samples = try await loadSamples16kMono(path: "path/to/audio.wav")

    // 3) Run VAD and print speech segments (512-sample frames)
    let results = try await vad.processAudioFile(samples)
    let sampleRate = 16000.0
    let frame = 512.0

    var startIndex: Int? = nil
    for (i, r) in results.enumerated() {
        if r.isVoiceActive {
            if startIndex == nil { startIndex = i }
        } else if let s = startIndex {
            let startSec = (Double(s) * frame) / sampleRate
            let endSec = (Double(i + 1) * frame) / sampleRate
            print(String(format: "Speech: %.2f–%.2fs", startSec, endSec))
            startIndex = nil
        }
    }
}

# Run VAD benchmark (mini50 dataset by default)
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3

Showcase

Make a PR if you want to add your app!

App	Description
Voice Ink	Local AI for instant, private transcription with near-perfect accuracy. Uses Parakeet ASR.
Spokenly	Mac dictation app for fast, accurate voice-to-text; supports real-time dictation and file transcription. Uses Parakeet ASR and speaker diarization.
Slipbox	Privacy-first meeting assistant for real-time conversation intelligence. Uses Parakeet ASR (iOS) and speaker diarization across platforms.
Whisper Mate	Transcribes movies and audio locally; records and transcribes in real time from speakers or system apps. Uses speaker diarization.

Everything Else

FAQs

CLI is available on macOS only. For iOS, use the library programmatically.
Models auto-download on first use. If your network restricts Hugging Face access, set an HTTPS proxy: export https_proxy=http://127.0.0.1:7890.
Windows alternative in development: fluid-server
If you're looking to get the system audio on a Mac, take a look at this repo for reference AudioCap

License

Apache 2.0 — see LICENSE for details.

Acknowledgments

This project builds upon the excellent work of the sherpa-onnx project for speaker diarization algorithms and techniques.

Pyannote: https://github.com/pyannote/pyannote-audio

Wewpeaker: https://github.com/wenet-e2e/wespeaker

Parakeet-mlx: https://github.com/senstella/parakeet-mlx

silero-vad: https://github.com/snakers4/silero-vad

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.claude		.claude
.github		.github
.vscode		.vscode
Documentation		Documentation
Sources		Sources
Tests/FluidAudioTests		Tests/FluidAudioTests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.swift-format		.swift-format
.swift-version		.swift-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md
banner.png		banner.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FluidAudio - Speaker diarization, voice-activity-detection and transcription with CoreML

Highlights

Installation

Documentation

Documentation Index

MCP Server

Automatic Speech Recognition (ASR) / Transcription

ASR Quick Start

Speaker Diarization

Speaker Diarization Quick Start

CLI

Voice Activity Detection (VAD)

VAD Quick Start

Showcase

Everything Else

FAQs

License

Acknowledgments

About

Uh oh!

Releases 8

Uh oh!

Contributors 8

Uh oh!

Languages

License

FluidInference/FluidAudio

Folders and files

Latest commit

History

Repository files navigation

FluidAudio - Speaker diarization, voice-activity-detection and transcription with CoreML

Highlights

Installation

Documentation

Documentation Index

MCP Server

Automatic Speech Recognition (ASR) / Transcription

ASR Quick Start

Speaker Diarization

Speaker Diarization Quick Start

CLI

Voice Activity Detection (VAD)

VAD Quick Start

Showcase

Everything Else

FAQs

License

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Uh oh!

Contributors 8

Uh oh!

Languages