Cross-platform offline speech app with transcription, translation, TTS, and history/export flows. This repository currently ships a focused model set per platform.
| Home | Transcription + Translation | Demo |
|---|---|---|
![]() |
![]() |
![]() |
SenseVoice Small with Apple Translation (English → Japanese) and TTS. Full video: MP4.
| Transcription + Translation | Demo |
|---|---|
![]() |
![]() |
SenseVoice Small with ML Kit Translation and TTS. Full video: MP4.
- The Windows desktop app has moved to the separate repository
windows-offline-transcribe/.
- ASR models:
SenseVoice SmallandApple Speech. - Audio source switching:
Voice(microphone)System(ReplayKit Broadcast Upload Extension)- Translation: Apple Translation framework bridge (
iOS 18+). - TTS:
AVSpeechSynthesizer(NativeTTSService). - History/details/export:
- SwiftData model (
TranscriptionRecord) - audio session files + ZIP export (
SessionFileManager,ZIPExporter)
- ASR models:
SenseVoice Small,Android Speech (Offline),Android Speech (Online). - Audio source switching:
Voice(microphone)System(MediaProjection playback capture)- Translation providers:
- ML Kit offline
- Android system translation (
API 31+) viaAndroidSystemTranslator - TTS:
AndroidTtsService(TextToSpeech). - History/details/export:
- Room (
TranscriptionEntity,AppDatabase) - playback + waveform + ZIP export (
AudioPlaybackManager,SessionExporter)
| Model ID | Engine | Languages |
|---|---|---|
sensevoice-small |
sherpa-onnx offline | zh/en/ja/ko/yue |
apple-speech |
SFSpeechRecognizer | 50+ languages |
| Model ID | Engine | Languages |
|---|---|---|
sensevoice-small |
sherpa-onnx offline | zh/en/ja/ko/yue |
android-speech-offline |
Android SpeechRecognizer (on-device, API 31+) | System languages |
android-speech-online |
Android SpeechRecognizer (standard recognizer) | System languages |
Both platforms support transcribing audio from other apps (music, video calls, etc.) in addition to microphone input. The user switches between Voice and System modes with a segmented control / chip selector above the record button.
iOS System mode — the red Start System Broadcast button replaces the mic button when System is selected.
iOS uses a Broadcast Upload Extension to capture system audio digitally from any app.
How it works:
- User taps the Start System Broadcast button (
RPSystemBroadcastPickerView). - iOS presents the system broadcast picker; the user selects the extension.
SampleHandler(inBroadcastUploadExtension/) receives.audioAppsample buffers.- Audio is converted to mono Float32 at 16 kHz and written to a shared memory-mapped ring buffer
in the App Group container (
group.com.voiceping.translate). - The main app's
SystemAudioSourcereads from the ring buffer and feeds samples to the ASR engine, transparently replacing the microphone path.
┌──────────────────────────┐ Darwin notify ┌─────────────────────────┐
│ Broadcast Upload Ext. │ ──────────────────────▶ │ Main App │
│ SampleHandler.swift │ │ SystemAudioSource │
│ │ Shared Ring Buffer │ WhisperService │
│ CMSampleBuffer → F32 │ ◀─────────────────────▶ │ → ASR Engine │
│ 16 kHz mono │ (~1.88 MB mmap file) │ → Translation / TTS │
└──────────────────────────┘ └─────────────────────────┘
| File | Role |
|---|---|
BroadcastUploadExtension/SampleHandler.swift |
Receives system audio, resamples to 16 kHz mono, writes to ring buffer |
Shared/SharedAudioRingBuffer.swift |
Lock-free SPSC ring buffer over memory-mapped file (480k samples, ~30 s) |
OfflineTranscription/Services/SystemAudioSource.swift |
Reads ring buffer, exposes samples to WhisperService |
OfflineTranscription/Views/BroadcastPickerView.swift |
RPSystemBroadcastPickerView wrapped as SwiftUI view |
Android uses the AudioPlaybackCapture API
(API 29+) via MediaProjection to capture audio playing from other apps.
How it works:
- User selects the System chip in the audio source card.
- On first use, the app starts
MediaProjectionService(foreground service) and requestsMediaProjectionManager.createScreenCaptureIntent()permission. AudioRecorder.createSystemPlaybackAudioRecord()builds anAudioRecordwithAudioPlaybackCaptureConfigurationcapturingUSAGE_MEDIA,USAGE_GAME, andUSAGE_UNKNOWN.- PCM 16-bit audio at 16 kHz is read in 100 ms chunks, normalized to
[-1, 1]floats, and fed to the ASR engine through the same recording pipeline as the microphone path.
┌──────────────────────────┐ Foreground Service ┌─────────────────────────┐
│ MediaProjectionService │ (holds projection) │ AudioRecorder │
│ (notification visible) │ ───────────────────────▶ │ AudioPlaybackCapture │
└──────────────────────────┘ │ → PCM 16kHz mono │
│ → WhisperEngine │
│ → Translation / TTS │
└─────────────────────────┘
| File | Role |
|---|---|
.../service/MediaProjectionService.kt |
Foreground service holding MediaProjection permission (Android 14+ requirement) |
.../service/AudioRecorder.kt |
Creates AudioRecord with playback capture config; reads PCM chunks |
.../model/AudioInputMode.kt |
MICROPHONE / SYSTEM_PLAYBACK enum |
.../ui/transcription/TranscriptionScreen.kt |
AudioInputModeCard — chip selector + permission launcher flow |
- Orchestrator:
OfflineTranscription/Services/WhisperService.swift+TranscriptionCoordinator.swift(inference loop, VAD, chunking) - Engines:
SherpaOnnxOfflineEngine— SenseVoice via sherpa-onnx ONNX RuntimeAppleSpeechEngine— Built-in SFSpeechRecognizer
- Translation:
AppleTranslationService(iOS 18+) - TTS:
NativeTTSService(AVSpeechSynthesizer) - Supporting services:
EngineFactory,ModelDownloader,SystemMetrics - Audio capture:
AudioRecorder(microphone),SystemAudioSource(ReplayKit ring buffer IPC) - Persistence/export: SwiftData +
SessionFileManager+ZIPExporter
- Orchestrator:
.../service/WhisperEngine.kt+TranscriptionCoordinator.kt(inference loop, VAD, chunking) - Engines:
SherpaOnnxEngine— SenseVoice via sherpa-onnx ONNX RuntimeAndroidSpeechEngine— Android SpeechRecognizer (online/offline)
- Translation:
MlKitTranslator— Google ML Kit offline (20 language pairs)AndroidSystemTranslator— Android system translation (API 31+)
- TTS:
AndroidTtsService(TextToSpeech) - Supporting services:
E2ETestOrchestrator,StreamingChunkManager,SystemMetrics,ModelDownloader - Audio capture:
AudioRecorder(microphone + MediaProjection playback capture),MediaProjectionService - Persistence/export: Room +
AudioPlaybackManager+SessionExporter
- Xcode 15+
- iOS 17+
xcodegen
- JDK 17
- Android SDK 35
- Android 8.0+ (
minSdk 26)
git clone --recurse-submodules <repo-url>
cd ios-android-offline-speech-translation
scripts/setup-ios-deps.sh
scripts/generate-ios-project.sh
open VoicePingIOSAndroidOfflineSpeechTranslation.xcodeprojcd VoicePingIOSAndroidOfflineSpeechTranslationAndroid
./setup-deps.sh
./gradlew assembleDebug# iOS
scripts/ci-ios-unit-test.sh
scripts/ios-e2e-test.sh
scripts/ios-ui-flow-tests.sh
# Android
scripts/ci-android-unit-test.sh
scripts/android-e2e-test.sh
scripts/android-userflow-test.sh- Runtime transcription/translation/TTS are local on device.
- Network access is for model/language pack downloads and dependency setup.
Apache License 2.0. See LICENSE.
Measured from E2E result.json files using a longer English fixture.
Fixture: artifacts/benchmarks/long_en_eval.wav (30.00s, 16kHz mono WAV)
- Per-model E2E runs with the same English fixture on each platform.
duration_sec = duration_ms / 1000from each modelresult.json.Wordsis computed from transcript words:[A-Za-z0-9']+.tok/susestokens_per_secondfromresult.jsonwhen present; otherwiseWords / duration_sec.RTF = duration_sec / audio_duration_sec.
| Model | Engine | Words | Inference (ms) | Tok/s | RTF | Status |
|---|---|---|---|---|---|---|
sensevoice-small |
sherpa-onnx offline (ONNX Runtime) | 58 | 2458 | 23.59 | 0.08 | ✅ PASS |
| Model | Engine | Words | Inference (ms) | Tok/s | RTF | Status |
|---|---|---|---|---|---|---|
sensevoice-small |
sherpa-onnx offline (ONNX Runtime) | 58 | 1725 | 33.63 | 0.06 | ✅ PASS |
rm -rf artifacts/e2e/ios/* artifacts/e2e/android/*TARGET_SECONDS=30 scripts/prepare-long-eval-audio.shEVAL_WAV_PATH=artifacts/benchmarks/long_en_eval.wav scripts/ios-e2e-test.shINSTRUMENT_TIMEOUT_SEC=300 EVAL_WAV_PATH=artifacts/benchmarks/long_en_eval.wav scripts/android-e2e-test.shpython3 scripts/generate-inference-report.py --audio artifacts/benchmarks/long_en_eval.wav --update-readme
One-command runner: TARGET_SECONDS=30 scripts/run-inference-benchmarks.sh
Want to see a new model or device benchmark? If there is an offline ASR model you would like added or benchmarked on a specific device, please open an issue with the model name and target hardware. Community contributions of benchmark results are also welcome.





