Android Offline Transcribe

Android app for on-device transcription with optional on-device translation. All ASR inference runs locally after model download.

Benchmark (Samsung Galaxy S10)

Demo (File Transcription)

MP4 demo video

Current Scope

Live transcription with confirmed text plus rolling hypothesis.
Audio source switching: Voice (microphone), System (playback capture via MediaProjection).
In-app model download/load/switch across 6 engine backends.
Runtime stats while recording (CPU, RAM, tok/s, elapsed audio).
Settings toggles for Voice Activity Detection, timestamps, and translation options.
Translation: Google ML Kit (MlKitTranslator) with 20 language pairs.
Android 14+ MediaProjection path uses a foreground service (MediaProjectionService).

Supported Models & Benchmarks

16 models across 6 engine types. Defined in ModelInfo.kt.

Benchmarked on Samsung Galaxy S10 (Android 12, API 31) on 2026-02-15.

Test audio: 30-second WAV (16 kHz, mono, PCM 16-bit) of JFK's 1961 inaugural address ("ask not what your country can do for you"), looped from the 11-second whisper.cpp/samples/jfk.wav clip to reach the target duration. The file is pushed to the device via adb before each run.

Metrics:

Inference — wall-clock time from engine.transcribe() call to result, measured with System.nanoTime(). Excludes model download and load time.
tok/s — output words per second of inference time (total_words / elapsed_seconds). Higher is faster.
RTF — Real Time Factor (inference_time / audio_duration). Values below 1.0 mean faster than real-time.
Result — PASS if the transcript contains expected keywords from the JFK speech; FAIL otherwise.

Model links point to the runtime distribution used by the app (mostly Hugging Face repos: csukuangfj/* sherpa-onnx, ggerganov/whisper.cpp GGML, Qwen/Qwen3-ASR-0.6B + jima/* Qwen ONNX).

Model	Engine	Params	Model size (download)	Languages	Inference	tok/s	RTF	Status
Moonshine Tiny	sherpa-onnx	27M	~125 MB	English	1,363 ms	42.55	0.05	✅ PASS
SenseVoice Small	sherpa-onnx	234M	~240 MB	zh/en/ja/ko/yue	1,725 ms	33.62	0.06	✅ PASS
Whisper Tiny	sherpa-onnx	39M	~100 MB	99 languages	2,068 ms	27.08	0.07	✅ PASS
Moonshine Base	sherpa-onnx	61M	~290 MB	English	2,251 ms	25.77	0.08	✅ PASS
Parakeet TDT 0.6B v3	sherpa-onnx	600M	~671 MB	25 European	2,841 ms	20.41	0.09	✅ PASS
Android Speech (Offline)	SpeechRecognizer	System	Built-in	50+ languages	3,615 ms	1.38	0.12	✅ PASS [2]
Android Speech (Online)	SpeechRecognizer	System	Built-in	100+ languages	3,591 ms	1.39	0.12	✅ PASS [2]
Zipformer Streaming	sherpa-onnx streaming	20M	~73 MB	English	3,568 ms	16.26	0.12	✅ PASS
Whisper Base (.en)	sherpa-onnx	74M	~160 MB	English	3,917 ms	14.81	0.13	✅ PASS
Whisper Base	sherpa-onnx	74M	~160 MB	99 languages	4,038 ms	14.36	0.13	✅ PASS
Whisper Small	sherpa-onnx	244M	~490 MB	99 languages	12,329 ms	4.70	0.41	✅ PASS
Qwen3 ASR 0.6B (ONNX)	ONNX Runtime INT8	600M	~1.9 GB	30 languages	15,881 ms	3.65	0.53	✅ PASS
Whisper Turbo	sherpa-onnx	809M	~1.0 GB	99 languages	17,930 ms	3.23	0.60	✅ PASS
Whisper Tiny (whisper.cpp)	whisper.cpp GGML	39M	~31 MB	99 languages	105,596 ms	0.55	3.52	✅ PASS
Qwen3 ASR 0.6B (CPU)	Pure C/NEON	600M	~1.8 GB	30 languages	338,261 ms	0.17	11.28	✅ PASS [3]
Omnilingual 300M	sherpa-onnx	300M	~365 MB	1,600+ languages	44,035 ms	0.05	1.47	❌ FAIL [1]

Params are approximate parameter counts (e.g. 27M = 27 million parameters, 0.6B = 600 million parameters).

[1] Omnilingual MMS CTC 300M outputs wrong language for English — known model limitation on both iOS and Android. CTC model does not support language conditioning (sherpa-onnx #2812).

[2] Android Speech uses acoustic loopback on API <33 (play WAV through speaker while SpeechRecognizer listens). Partial transcript — environment-dependent. API 33+ supports direct file input via EXTRA_AUDIO_SOURCE.

[3] Qwen3 ASR 0.6B CPU runs on this device but is extremely slow (RTF ~11). Prefer the ONNX INT8 variant on older phones.

15/16 ✅ PASS, 1 ❌ FAIL (known limitation), 0 OOM

Want to see a new model benchmarked? If there is an offline ASR model you would like added or benchmarked on a specific device, please open an issue with the model name and target hardware. Community contributions of benchmark results on different devices are also welcome.

Architecture

Orchestrator: service/WhisperEngine.kt + service/TranscriptionCoordinator.kt (inference loop, VAD, chunking)
Engines (6 backends):
- SherpaOnnxEngine — Moonshine, SenseVoice, Parakeet, Whisper, Omnilingual via sherpa-onnx ONNX Runtime
- SherpaOnnxStreamingEngine — Zipformer via sherpa-onnx ONNX Runtime (100 ms chunks)
- CactusEngine — Whisper via whisper.cpp (GGML, via JNI)
- QwenASREngine — Qwen3 ASR via antirez/qwen-asr (Pure C, ARM NEON)
- QwenOnnxEngine — Qwen3 ASR INT8 via ONNX Runtime (uses ORT from sherpa-onnx)
- AndroidSpeechEngine — Android SpeechRecognizer (online/offline)
Supporting services: E2ETestOrchestrator, StreamingChunkManager, SystemMetrics
Audio capture:
- AudioRecorder (microphone + playback capture)
- MediaProjectionService (foreground service for capture flow)
Translation: MlKitTranslator (Google ML Kit, 20 language pairs)
UI: ui/transcription/TranscriptionScreen.kt, ui/setup/ModelSetupScreen.kt

Requirements

Android Studio / Android SDK
JDK 17
Android SDK 35
CMake 3.22.1
Android 8.0+ (minSdk 26)

Setup

git clone --recurse-submodules <repo-url>
cd android-offline-transcribe/VoicePingAndroidOfflineTranscribe
./setup-deps.sh
./gradlew assembleDebug

Tests

# Unit tests (172 tests)
cd VoicePingAndroidOfflineTranscribe
./gradlew testDebugUnitTest

# E2E benchmark (all 16 models)
bash /tmp/android-benchmark.sh

# CI and automation
scripts/ci-android-unit-test.sh
scripts/android-e2e-test.sh
scripts/android-userflow-test.sh

Privacy & Network Usage

All audio recording and transcription run locally on device. The app makes no analytics, telemetry, or crash-reporting calls. Network is used only for model downloads and one optional cloud mode, listed below:

Connection	Destination	When	Data Sent
sherpa-onnx model download	`huggingface.co/csukuangfj/*`	User selects a Moonshine, SenseVoice, Parakeet, Whisper (sherpa), Zipformer, or Omnilingual model	None (HTTPS GET only)
whisper.cpp model download	`huggingface.co/ggerganov/whisper.cpp`	User selects Whisper Tiny (whisper.cpp backend)	None (HTTPS GET only)
Qwen3 ASR download (CPU)	`huggingface.co/Qwen/Qwen3-ASR-0.6B`	User selects Qwen3 ASR CPU	None (HTTPS GET only)
Qwen3 ASR download (ONNX)	`huggingface.co/jima/qwen3-asr-0.6b-onnx-int8`	User selects Qwen3 ASR ONNX	None (HTTPS GET only)
ML Kit translation model	Google servers	User enables translation and selects a language pair	None (model download only, ~30 MB per pair)
Android Speech (Online)	Google Cloud Speech	User selects `android-speech-online` model	Audio sent to Google for recognition
Android Speech (Offline)	None	User selects `android-speech-offline` model	None — fully on-device

All model downloads are user-initiated (on model selection), cached locally, and never re-downloaded once present. No user audio, transcription text, or usage data leaves the device — except when the user explicitly selects Android Speech (Online), which sends audio to Google Cloud for recognition.

License

Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
VoicePingAndroidOfflineTranscribe		VoicePingAndroidOfflineTranscribe
assets		assets
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Android Offline Transcribe

Benchmark (Samsung Galaxy S10)

Demo (File Transcription)

Current Scope

Supported Models & Benchmarks

Architecture

Requirements

Setup

Tests

Privacy & Network Usage

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

voiceping-ai/android-offline-transcribe

Folders and files

Latest commit

History

Repository files navigation

Android Offline Transcribe

Benchmark (Samsung Galaxy S10)

Demo (File Transcription)

Current Scope

Supported Models & Benchmarks

Architecture

Requirements

Setup

Tests

Privacy & Network Usage

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages