Skip to content

voiceping-ai/android-offline-transcribe

Repository files navigation

Android Offline Transcribe

Android Build

Android app for on-device transcription with optional on-device translation. All ASR inference runs locally after model download.

Benchmark (Samsung Galaxy S10)

Android Inference Speed

Model Setup    Transcription

Demo (File Transcription)

File transcription demo (GIF)

MP4 demo video

Current Scope

  • Live transcription with confirmed text plus rolling hypothesis.
  • Audio source switching: Voice (microphone), System (playback capture via MediaProjection).
  • In-app model download/load/switch across 6 engine backends.
  • Runtime stats while recording (CPU, RAM, tok/s, elapsed audio).
  • Settings toggles for Voice Activity Detection, timestamps, and translation options.
  • Translation: Google ML Kit (MlKitTranslator) with 20 language pairs.
  • Android 14+ MediaProjection path uses a foreground service (MediaProjectionService).

Supported Models & Benchmarks

16 models across 6 engine types. Defined in ModelInfo.kt.

Benchmarked on Samsung Galaxy S10 (Android 12, API 31) on 2026-02-15.

Test audio: 30-second WAV (16 kHz, mono, PCM 16-bit) of JFK's 1961 inaugural address ("ask not what your country can do for you"), looped from the 11-second whisper.cpp/samples/jfk.wav clip to reach the target duration. The file is pushed to the device via adb before each run.

Metrics:

  • Inference — wall-clock time from engine.transcribe() call to result, measured with System.nanoTime(). Excludes model download and load time.
  • tok/s — output words per second of inference time (total_words / elapsed_seconds). Higher is faster.
  • RTF — Real Time Factor (inference_time / audio_duration). Values below 1.0 mean faster than real-time.
  • Result — PASS if the transcript contains expected keywords from the JFK speech; FAIL otherwise.

Model links point to the runtime distribution used by the app (mostly Hugging Face repos: csukuangfj/* sherpa-onnx, ggerganov/whisper.cpp GGML, Qwen/Qwen3-ASR-0.6B + jima/* Qwen ONNX).

Model Engine Params Model size (download) Languages Inference tok/s RTF Status
Moonshine Tiny sherpa-onnx 27M ~125 MB English 1,363 ms 42.55 0.05 ✅ PASS
SenseVoice Small sherpa-onnx 234M ~240 MB zh/en/ja/ko/yue 1,725 ms 33.62 0.06 ✅ PASS
Whisper Tiny sherpa-onnx 39M ~100 MB 99 languages 2,068 ms 27.08 0.07 ✅ PASS
Moonshine Base sherpa-onnx 61M ~290 MB English 2,251 ms 25.77 0.08 ✅ PASS
Parakeet TDT 0.6B v3 sherpa-onnx 600M ~671 MB 25 European 2,841 ms 20.41 0.09 ✅ PASS
Android Speech (Offline) SpeechRecognizer System Built-in 50+ languages 3,615 ms 1.38 0.12 ✅ PASS [2]
Android Speech (Online) SpeechRecognizer System Built-in 100+ languages 3,591 ms 1.39 0.12 ✅ PASS [2]
Zipformer Streaming sherpa-onnx streaming 20M ~73 MB English 3,568 ms 16.26 0.12 ✅ PASS
Whisper Base (.en) sherpa-onnx 74M ~160 MB English 3,917 ms 14.81 0.13 ✅ PASS
Whisper Base sherpa-onnx 74M ~160 MB 99 languages 4,038 ms 14.36 0.13 ✅ PASS
Whisper Small sherpa-onnx 244M ~490 MB 99 languages 12,329 ms 4.70 0.41 ✅ PASS
Qwen3 ASR 0.6B (ONNX) ONNX Runtime INT8 600M ~1.9 GB 30 languages 15,881 ms 3.65 0.53 ✅ PASS
Whisper Turbo sherpa-onnx 809M ~1.0 GB 99 languages 17,930 ms 3.23 0.60 ✅ PASS
Whisper Tiny (whisper.cpp) whisper.cpp GGML 39M ~31 MB 99 languages 105,596 ms 0.55 3.52 ✅ PASS
Qwen3 ASR 0.6B (CPU) Pure C/NEON 600M ~1.8 GB 30 languages 338,261 ms 0.17 11.28 ✅ PASS [3]
Omnilingual 300M sherpa-onnx 300M ~365 MB 1,600+ languages 44,035 ms 0.05 1.47 ❌ FAIL [1]

Params are approximate parameter counts (e.g. 27M = 27 million parameters, 0.6B = 600 million parameters).

[1] Omnilingual MMS CTC 300M outputs wrong language for English — known model limitation on both iOS and Android. CTC model does not support language conditioning (sherpa-onnx #2812).

[2] Android Speech uses acoustic loopback on API <33 (play WAV through speaker while SpeechRecognizer listens). Partial transcript — environment-dependent. API 33+ supports direct file input via EXTRA_AUDIO_SOURCE.

[3] Qwen3 ASR 0.6B CPU runs on this device but is extremely slow (RTF ~11). Prefer the ONNX INT8 variant on older phones.

15/16 ✅ PASS, 1 ❌ FAIL (known limitation), 0 OOM

Want to see a new model benchmarked? If there is an offline ASR model you would like added or benchmarked on a specific device, please open an issue with the model name and target hardware. Community contributions of benchmark results on different devices are also welcome.

Architecture

  • Orchestrator: service/WhisperEngine.kt + service/TranscriptionCoordinator.kt (inference loop, VAD, chunking)
  • Engines (6 backends):
    • SherpaOnnxEngine — Moonshine, SenseVoice, Parakeet, Whisper, Omnilingual via sherpa-onnx ONNX Runtime
    • SherpaOnnxStreamingEngine — Zipformer via sherpa-onnx ONNX Runtime (100 ms chunks)
    • CactusEngine — Whisper via whisper.cpp (GGML, via JNI)
    • QwenASREngine — Qwen3 ASR via antirez/qwen-asr (Pure C, ARM NEON)
    • QwenOnnxEngine — Qwen3 ASR INT8 via ONNX Runtime (uses ORT from sherpa-onnx)
    • AndroidSpeechEngine — Android SpeechRecognizer (online/offline)
  • Supporting services: E2ETestOrchestrator, StreamingChunkManager, SystemMetrics
  • Audio capture:
    • AudioRecorder (microphone + playback capture)
    • MediaProjectionService (foreground service for capture flow)
  • Translation: MlKitTranslator (Google ML Kit, 20 language pairs)
  • UI: ui/transcription/TranscriptionScreen.kt, ui/setup/ModelSetupScreen.kt

Requirements

  • Android Studio / Android SDK
  • JDK 17
  • Android SDK 35
  • CMake 3.22.1
  • Android 8.0+ (minSdk 26)

Setup

git clone --recurse-submodules <repo-url>
cd android-offline-transcribe/VoicePingAndroidOfflineTranscribe
./setup-deps.sh
./gradlew assembleDebug

Tests

# Unit tests (172 tests)
cd VoicePingAndroidOfflineTranscribe
./gradlew testDebugUnitTest

# E2E benchmark (all 16 models)
bash /tmp/android-benchmark.sh

# CI and automation
scripts/ci-android-unit-test.sh
scripts/android-e2e-test.sh
scripts/android-userflow-test.sh

Privacy & Network Usage

All audio recording and transcription run locally on device. The app makes no analytics, telemetry, or crash-reporting calls. Network is used only for model downloads and one optional cloud mode, listed below:

Connection Destination When Data Sent
sherpa-onnx model download huggingface.co/csukuangfj/* User selects a Moonshine, SenseVoice, Parakeet, Whisper (sherpa), Zipformer, or Omnilingual model None (HTTPS GET only)
whisper.cpp model download huggingface.co/ggerganov/whisper.cpp User selects Whisper Tiny (whisper.cpp backend) None (HTTPS GET only)
Qwen3 ASR download (CPU) huggingface.co/Qwen/Qwen3-ASR-0.6B User selects Qwen3 ASR CPU None (HTTPS GET only)
Qwen3 ASR download (ONNX) huggingface.co/jima/qwen3-asr-0.6b-onnx-int8 User selects Qwen3 ASR ONNX None (HTTPS GET only)
ML Kit translation model Google servers User enables translation and selects a language pair None (model download only, ~30 MB per pair)
Android Speech (Online) Google Cloud Speech User selects android-speech-online model Audio sent to Google for recognition
Android Speech (Offline) None User selects android-speech-offline model None — fully on-device

All model downloads are user-initiated (on model selection), cached locally, and never re-downloaded once present. No user audio, transcription text, or usage data leaves the device — except when the user explicitly selects Android Speech (Online), which sends audio to Google Cloud for recognition.

License

Apache License 2.0. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •