iOS + Android Offline Speech Translation

Cross-platform offline speech app with transcription, translation, TTS, and history/export flows. This repository currently ships a focused model set per platform.

Benchmark — iOS (iPad Pro 3rd gen, A12X)

Benchmark — Android (Samsung Galaxy S10)

iOS (iPad Pro 3rd gen)

Home	Transcription + Translation	Demo

SenseVoice Small with Apple Translation (English → Japanese) and TTS. Full video: MP4.

Android

Transcription + Translation	Demo

SenseVoice Small with ML Kit Translation and TTS. Full video: MP4.

Windows (Desktop)

The Windows desktop app has moved to the separate repository windows-offline-transcribe/.

Current Scope (Code-Accurate)

iOS app (`OfflineTranscription`)

ASR models: SenseVoice Small and Apple Speech.
Audio source switching:
Voice (microphone)
System (ReplayKit Broadcast Upload Extension)
Translation: Apple Translation framework bridge (iOS 18+).
TTS: AVSpeechSynthesizer (NativeTTSService).
History/details/export:
SwiftData model (TranscriptionRecord)
audio session files + ZIP export (SessionFileManager, ZIPExporter)

Android app (`VoicePingIOSAndroidOfflineSpeechTranslationAndroid`)

ASR models: SenseVoice Small, Android Speech (Offline), Android Speech (Online).
Audio source switching:
Voice (microphone)
System (MediaProjection playback capture)
Translation providers:
ML Kit offline
Android system translation (API 31+) via AndroidSystemTranslator
TTS: AndroidTtsService (TextToSpeech).
History/details/export:
Room (TranscriptionEntity, AppDatabase)
playback + waveform + ZIP export (AudioPlaybackManager, SessionExporter)

Supported Models

iOS (`OfflineTranscription/Models/ModelInfo.swift`)

Model ID	Engine	Languages
`sensevoice-small`	sherpa-onnx offline	`zh/en/ja/ko/yue`
`apple-speech`	SFSpeechRecognizer	`50+ languages`

Android (`.../model/ModelInfo.kt`)

Model ID	Engine	Languages
`sensevoice-small`	sherpa-onnx offline	`zh/en/ja/ko/yue`
`android-speech-offline`	Android SpeechRecognizer (on-device, API 31+)	`System languages`
`android-speech-online`	Android SpeechRecognizer (standard recognizer)	`System languages`

System Audio Capture

Both platforms support transcribing audio from other apps (music, video calls, etc.) in addition to microphone input. The user switches between Voice and System modes with a segmented control / chip selector above the record button.

iOS System mode — the red Start System Broadcast button replaces the mic button when System is selected.

iOS — ReplayKit Broadcast Upload Extension

iOS uses a Broadcast Upload Extension to capture system audio digitally from any app.

How it works:

User taps the Start System Broadcast button (RPSystemBroadcastPickerView).
iOS presents the system broadcast picker; the user selects the extension.
SampleHandler (in BroadcastUploadExtension/) receives .audioApp sample buffers.
Audio is converted to mono Float32 at 16 kHz and written to a shared memory-mapped ring buffer in the App Group container (group.com.voiceping.translate).
The main app's SystemAudioSource reads from the ring buffer and feeds samples to the ASR engine, transparently replacing the microphone path.

┌──────────────────────────┐      Darwin notify       ┌─────────────────────────┐
│  Broadcast Upload Ext.   │  ──────────────────────▶  │      Main App           │
│  SampleHandler.swift     │                           │  SystemAudioSource       │
│                          │   Shared Ring Buffer      │  WhisperService          │
│  CMSampleBuffer → F32    │  ◀─────────────────────▶  │  → ASR Engine            │
│  16 kHz mono             │   (~1.88 MB mmap file)    │  → Translation / TTS     │
└──────────────────────────┘                           └─────────────────────────┘

File	Role
`BroadcastUploadExtension/SampleHandler.swift`	Receives system audio, resamples to 16 kHz mono, writes to ring buffer
`Shared/SharedAudioRingBuffer.swift`	Lock-free SPSC ring buffer over memory-mapped file (480k samples, ~30 s)
`OfflineTranscription/Services/SystemAudioSource.swift`	Reads ring buffer, exposes samples to WhisperService
`OfflineTranscription/Views/BroadcastPickerView.swift`	`RPSystemBroadcastPickerView` wrapped as SwiftUI view

Android — MediaProjection Playback Capture

Android uses the AudioPlaybackCapture API (API 29+) via MediaProjection to capture audio playing from other apps.

How it works:

User selects the System chip in the audio source card.
On first use, the app starts MediaProjectionService (foreground service) and requests MediaProjectionManager.createScreenCaptureIntent() permission.
AudioRecorder.createSystemPlaybackAudioRecord() builds an AudioRecord with AudioPlaybackCaptureConfiguration capturing USAGE_MEDIA, USAGE_GAME, and USAGE_UNKNOWN.
PCM 16-bit audio at 16 kHz is read in 100 ms chunks, normalized to [-1, 1] floats, and fed to the ASR engine through the same recording pipeline as the microphone path.

┌──────────────────────────┐   Foreground Service      ┌─────────────────────────┐
│  MediaProjectionService  │   (holds projection)      │  AudioRecorder          │
│  (notification visible)  │ ───────────────────────▶  │  AudioPlaybackCapture   │
└──────────────────────────┘                           │  → PCM 16kHz mono       │
                                                       │  → WhisperEngine        │
                                                       │  → Translation / TTS    │
                                                       └─────────────────────────┘

File	Role
`.../service/MediaProjectionService.kt`	Foreground service holding MediaProjection permission (Android 14+ requirement)
`.../service/AudioRecorder.kt`	Creates `AudioRecord` with playback capture config; reads PCM chunks
`.../model/AudioInputMode.kt`	`MICROPHONE` / `SYSTEM_PLAYBACK` enum
`.../ui/transcription/TranscriptionScreen.kt`	`AudioInputModeCard` — chip selector + permission launcher flow

Architecture

iOS

Orchestrator: OfflineTranscription/Services/WhisperService.swift + TranscriptionCoordinator.swift (inference loop, VAD, chunking)
Engines:
- SherpaOnnxOfflineEngine — SenseVoice via sherpa-onnx ONNX Runtime
- AppleSpeechEngine — Built-in SFSpeechRecognizer
Translation: AppleTranslationService (iOS 18+)
TTS: NativeTTSService (AVSpeechSynthesizer)
Supporting services: EngineFactory, ModelDownloader, SystemMetrics
Audio capture: AudioRecorder (microphone), SystemAudioSource (ReplayKit ring buffer IPC)
Persistence/export: SwiftData + SessionFileManager + ZIPExporter

Android

Orchestrator: .../service/WhisperEngine.kt + TranscriptionCoordinator.kt (inference loop, VAD, chunking)
Engines:
- SherpaOnnxEngine — SenseVoice via sherpa-onnx ONNX Runtime
- AndroidSpeechEngine — Android SpeechRecognizer (online/offline)
Translation:
- MlKitTranslator — Google ML Kit offline (20 language pairs)
- AndroidSystemTranslator — Android system translation (API 31+)
TTS: AndroidTtsService (TextToSpeech)
Supporting services: E2ETestOrchestrator, StreamingChunkManager, SystemMetrics, ModelDownloader
Audio capture: AudioRecorder (microphone + MediaProjection playback capture), MediaProjectionService
Persistence/export: Room + AudioPlaybackManager + SessionExporter

Requirements

iOS

Xcode 15+
iOS 17+
xcodegen

Android

JDK 17
Android SDK 35
Android 8.0+ (minSdk 26)

Setup

iOS

git clone --recurse-submodules <repo-url>
cd ios-android-offline-speech-translation
scripts/setup-ios-deps.sh
scripts/generate-ios-project.sh
open VoicePingIOSAndroidOfflineSpeechTranslation.xcodeproj

Android

cd VoicePingIOSAndroidOfflineSpeechTranslationAndroid
./setup-deps.sh
./gradlew assembleDebug

Tests and Automation

# iOS
scripts/ci-ios-unit-test.sh
scripts/ios-e2e-test.sh
scripts/ios-ui-flow-tests.sh

# Android
scripts/ci-android-unit-test.sh
scripts/android-e2e-test.sh
scripts/android-userflow-test.sh

Privacy

Runtime transcription/translation/TTS are local on device.
Network access is for model/language pack downloads and dependency setup.

License

Apache License 2.0. See LICENSE.

Inference Token Speed Benchmarks

Measured from E2E result.json files using a longer English fixture.

Fixture: artifacts/benchmarks/long_en_eval.wav (30.00s, 16kHz mono WAV)

Evaluation Method

Per-model E2E runs with the same English fixture on each platform.
duration_sec = duration_ms / 1000 from each model result.json.
Words is computed from transcript words: [A-Za-z0-9']+.
tok/s uses tokens_per_second from result.json when present; otherwise Words / duration_sec.
RTF = duration_sec / audio_duration_sec.

iOS Results

Model	Engine	Words	Inference (ms)	Tok/s	RTF	Status
`sensevoice-small`	sherpa-onnx offline (ONNX Runtime)	58	2458	23.59	0.08	✅ PASS

Android Results

Model	Engine	Words	Inference (ms)	Tok/s	RTF	Status
`sensevoice-small`	sherpa-onnx offline (ONNX Runtime)	58	1725	33.63	0.06	✅ PASS

Reproduce

rm -rf artifacts/e2e/ios/* artifacts/e2e/android/*
TARGET_SECONDS=30 scripts/prepare-long-eval-audio.sh
EVAL_WAV_PATH=artifacts/benchmarks/long_en_eval.wav scripts/ios-e2e-test.sh
INSTRUMENT_TIMEOUT_SEC=300 EVAL_WAV_PATH=artifacts/benchmarks/long_en_eval.wav scripts/android-e2e-test.sh
python3 scripts/generate-inference-report.py --audio artifacts/benchmarks/long_en_eval.wav --update-readme

One-command runner: TARGET_SECONDS=30 scripts/run-inference-benchmarks.sh

Want to see a new model or device benchmark? If there is an offline ASR model you would like added or benchmarked on a specific device, please open an issue with the model name and target hardware. Community contributions of benchmark results are also welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
BroadcastUploadExtension		BroadcastUploadExtension
LocalPackages/SherpaOnnxKit		LocalPackages/SherpaOnnxKit
OfflineTranscription		OfflineTranscription
OfflineTranscriptionCITests		OfflineTranscriptionCITests
OfflineTranscriptionTests		OfflineTranscriptionTests
OfflineTranscriptionUITests		OfflineTranscriptionUITests
Shared		Shared
VoicePingIOSAndroidOfflineSpeechTranslation.xcodeproj/project.xcworkspace		VoicePingIOSAndroidOfflineSpeechTranslation.xcodeproj/project.xcworkspace
VoicePingIOSAndroidOfflineSpeechTranslationAndroid		VoicePingIOSAndroidOfflineSpeechTranslationAndroid
assets		assets
docs		docs
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
project.local.yml.example		project.local.yml.example
project.yml		project.yml

Folders and files

Latest commit

History

Repository files navigation

iOS + Android Offline Speech Translation

Benchmark — iOS (iPad Pro 3rd gen, A12X)

Benchmark — Android (Samsung Galaxy S10)

iOS (iPad Pro 3rd gen)

Android

Windows (Desktop)

Current Scope (Code-Accurate)

iOS app (OfflineTranscription)

Android app (VoicePingIOSAndroidOfflineSpeechTranslationAndroid)

Supported Models

iOS (OfflineTranscription/Models/ModelInfo.swift)

Android (.../model/ModelInfo.kt)

System Audio Capture

iOS — ReplayKit Broadcast Upload Extension

Android — MediaProjection Playback Capture

Architecture

iOS

Android

Requirements

iOS

Android

Setup

iOS

Android

Tests and Automation

Privacy

License

Inference Token Speed Benchmarks

Evaluation Method

iOS Results

Android Results

Reproduce

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

iOS app (`OfflineTranscription`)

Android app (`VoicePingIOSAndroidOfflineSpeechTranslationAndroid`)

iOS (`OfflineTranscription/Models/ModelInfo.swift`)

Android (`.../model/ModelInfo.kt`)

Packages