Android app for on-device transcription with optional on-device translation. All ASR inference runs locally after model download.
- Live transcription with confirmed text plus rolling hypothesis.
- Audio source switching:
Voice(microphone),System(playback capture via MediaProjection). - In-app model download/load/switch across 6 engine backends.
- Runtime stats while recording (
CPU,RAM,tok/s, elapsed audio). - Settings toggles for
Voice Activity Detection, timestamps, and translation options. - Translation: Google ML Kit (
MlKitTranslator) with 20 language pairs. - Android 14+ MediaProjection path uses a foreground service (
MediaProjectionService).
16 models across 6 engine types. Defined in ModelInfo.kt.
Benchmarked on Samsung Galaxy S10 (Android 12, API 31) on 2026-02-15.
Test audio: 30-second WAV (16 kHz, mono, PCM 16-bit) of JFK's 1961 inaugural address ("ask not what your country can do for you"), looped from the 11-second whisper.cpp/samples/jfk.wav clip to reach the target duration. The file is pushed to the device via adb before each run.
Metrics:
- Inference — wall-clock time from
engine.transcribe()call to result, measured withSystem.nanoTime(). Excludes model download and load time. - tok/s — output words per second of inference time (
total_words / elapsed_seconds). Higher is faster. - RTF — Real Time Factor (
inference_time / audio_duration). Values below 1.0 mean faster than real-time. - Result — PASS if the transcript contains expected keywords from the JFK speech; FAIL otherwise.
Model links point to the runtime distribution used by the app (mostly Hugging Face repos: csukuangfj/* sherpa-onnx, ggerganov/whisper.cpp GGML, Qwen/Qwen3-ASR-0.6B + jima/* Qwen ONNX).
| Model | Engine | Params | Model size (download) | Languages | Inference | tok/s | RTF | Status |
|---|---|---|---|---|---|---|---|---|
| Moonshine Tiny | sherpa-onnx | 27M | ~125 MB | English | 1,363 ms | 42.55 | 0.05 | ✅ PASS |
| SenseVoice Small | sherpa-onnx | 234M | ~240 MB | zh/en/ja/ko/yue | 1,725 ms | 33.62 | 0.06 | ✅ PASS |
| Whisper Tiny | sherpa-onnx | 39M | ~100 MB | 99 languages | 2,068 ms | 27.08 | 0.07 | ✅ PASS |
| Moonshine Base | sherpa-onnx | 61M | ~290 MB | English | 2,251 ms | 25.77 | 0.08 | ✅ PASS |
| Parakeet TDT 0.6B v3 | sherpa-onnx | 600M | ~671 MB | 25 European | 2,841 ms | 20.41 | 0.09 | ✅ PASS |
| Android Speech (Offline) | SpeechRecognizer | System | Built-in | 50+ languages | 3,615 ms | 1.38 | 0.12 | ✅ PASS [2] |
| Android Speech (Online) | SpeechRecognizer | System | Built-in | 100+ languages | 3,591 ms | 1.39 | 0.12 | ✅ PASS [2] |
| Zipformer Streaming | sherpa-onnx streaming | 20M | ~73 MB | English | 3,568 ms | 16.26 | 0.12 | ✅ PASS |
| Whisper Base (.en) | sherpa-onnx | 74M | ~160 MB | English | 3,917 ms | 14.81 | 0.13 | ✅ PASS |
| Whisper Base | sherpa-onnx | 74M | ~160 MB | 99 languages | 4,038 ms | 14.36 | 0.13 | ✅ PASS |
| Whisper Small | sherpa-onnx | 244M | ~490 MB | 99 languages | 12,329 ms | 4.70 | 0.41 | ✅ PASS |
| Qwen3 ASR 0.6B (ONNX) | ONNX Runtime INT8 | 600M | ~1.9 GB | 30 languages | 15,881 ms | 3.65 | 0.53 | ✅ PASS |
| Whisper Turbo | sherpa-onnx | 809M | ~1.0 GB | 99 languages | 17,930 ms | 3.23 | 0.60 | ✅ PASS |
| Whisper Tiny (whisper.cpp) | whisper.cpp GGML | 39M | ~31 MB | 99 languages | 105,596 ms | 0.55 | 3.52 | ✅ PASS |
| Qwen3 ASR 0.6B (CPU) | Pure C/NEON | 600M | ~1.8 GB | 30 languages | 338,261 ms | 0.17 | 11.28 | ✅ PASS [3] |
| Omnilingual 300M | sherpa-onnx | 300M | ~365 MB | 1,600+ languages | 44,035 ms | 0.05 | 1.47 | ❌ FAIL [1] |
Params are approximate parameter counts (e.g.
27M= 27 million parameters,0.6B= 600 million parameters).
[1] Omnilingual MMS CTC 300M outputs wrong language for English — known model limitation on both iOS and Android. CTC model does not support language conditioning (sherpa-onnx #2812).
[2] Android Speech uses acoustic loopback on API <33 (play WAV through speaker while SpeechRecognizer listens). Partial transcript — environment-dependent. API 33+ supports direct file input via
EXTRA_AUDIO_SOURCE.[3] Qwen3 ASR 0.6B CPU runs on this device but is extremely slow (RTF ~11). Prefer the ONNX INT8 variant on older phones.
15/16 ✅ PASS, 1 ❌ FAIL (known limitation), 0 OOM
Want to see a new model benchmarked? If there is an offline ASR model you would like added or benchmarked on a specific device, please open an issue with the model name and target hardware. Community contributions of benchmark results on different devices are also welcome.
- Orchestrator:
service/WhisperEngine.kt+service/TranscriptionCoordinator.kt(inference loop, VAD, chunking) - Engines (6 backends):
SherpaOnnxEngine— Moonshine, SenseVoice, Parakeet, Whisper, Omnilingual via sherpa-onnx ONNX RuntimeSherpaOnnxStreamingEngine— Zipformer via sherpa-onnx ONNX Runtime (100 ms chunks)CactusEngine— Whisper via whisper.cpp (GGML, via JNI)QwenASREngine— Qwen3 ASR via antirez/qwen-asr (Pure C, ARM NEON)QwenOnnxEngine— Qwen3 ASR INT8 via ONNX Runtime (uses ORT from sherpa-onnx)AndroidSpeechEngine— Android SpeechRecognizer (online/offline)
- Supporting services:
E2ETestOrchestrator,StreamingChunkManager,SystemMetrics - Audio capture:
AudioRecorder(microphone + playback capture)MediaProjectionService(foreground service for capture flow)
- Translation:
MlKitTranslator(Google ML Kit, 20 language pairs) - UI:
ui/transcription/TranscriptionScreen.kt,ui/setup/ModelSetupScreen.kt
- Android Studio / Android SDK
- JDK 17
- Android SDK 35
- CMake 3.22.1
- Android 8.0+ (
minSdk 26)
git clone --recurse-submodules <repo-url>
cd android-offline-transcribe/VoicePingAndroidOfflineTranscribe
./setup-deps.sh
./gradlew assembleDebug# Unit tests (172 tests)
cd VoicePingAndroidOfflineTranscribe
./gradlew testDebugUnitTest
# E2E benchmark (all 16 models)
bash /tmp/android-benchmark.sh
# CI and automation
scripts/ci-android-unit-test.sh
scripts/android-e2e-test.sh
scripts/android-userflow-test.shAll audio recording and transcription run locally on device. The app makes no analytics, telemetry, or crash-reporting calls. Network is used only for model downloads and one optional cloud mode, listed below:
| Connection | Destination | When | Data Sent |
|---|---|---|---|
| sherpa-onnx model download | huggingface.co/csukuangfj/* |
User selects a Moonshine, SenseVoice, Parakeet, Whisper (sherpa), Zipformer, or Omnilingual model | None (HTTPS GET only) |
| whisper.cpp model download | huggingface.co/ggerganov/whisper.cpp |
User selects Whisper Tiny (whisper.cpp backend) | None (HTTPS GET only) |
| Qwen3 ASR download (CPU) | huggingface.co/Qwen/Qwen3-ASR-0.6B |
User selects Qwen3 ASR CPU | None (HTTPS GET only) |
| Qwen3 ASR download (ONNX) | huggingface.co/jima/qwen3-asr-0.6b-onnx-int8 |
User selects Qwen3 ASR ONNX | None (HTTPS GET only) |
| ML Kit translation model | Google servers | User enables translation and selects a language pair | None (model download only, ~30 MB per pair) |
| Android Speech (Online) | Google Cloud Speech | User selects android-speech-online model |
Audio sent to Google for recognition |
| Android Speech (Offline) | None | User selects android-speech-offline model |
None — fully on-device |
All model downloads are user-initiated (on model selection), cached locally, and never re-downloaded once present. No user audio, transcription text, or usage data leaves the device — except when the user explicitly selects Android Speech (Online), which sends audio to Google Cloud for recognition.
Apache License 2.0. See LICENSE.


