Fully offline AI that converts noisy radio communication into accurate transcripts and structured incident summaries. Optimized to run locally on Qualcomm AI laptops with no internet connection.
This project improves transcription reliability in emergency and field communication using optimized on-device speech recognition and local language models. It is based on simple-whisper-transcription and extends it with structured extraction via on-device LLaMA (Qualcomm Genie).
- Problem
- Solution
- Key Features
- Tech Stack
- What You Need Installed
- Quick Start: Run Backend & Frontend
- Whisper Models (ONNX)
- Optional: On-Device LLaMA (Genie)
- Optional: TTS (Deepgram)
- Code Organization
- Datasets & References
- Building an Executable
- Contributing
First responders and field teams rely on radios that produce noisy, clipped, and hard-to-understand audio. This leads to:
- Misheard instructions
- Missed location or hazard details
- Slower and less effective response
Internet access is often unavailable in these environments, making cloud solutions unreliable.
ClearComms solves this by running transcription and structuring fully on device.
ClearComms processes radio audio through three local stages:
Audio is transcribed locally using an optimized Whisper model (e.g. Whisper Base En from Qualcomm AI Hub, or Whisper large-v3-turbo).
Engineering focus includes:
- Running Whisper with ONNX Runtime (QNN/NPU on Qualcomm hardware)
- Model optimization and quantization
- Parameter tuning for noisy radio audio
- Low-latency on-device inference
The transcript can be processed by a local LLM (e.g. LLaMA via Qualcomm Genie) to turn raw speech into structured outputs.
Example
| Raw transcript | Structured output |
|---|---|
| unit 12 need backup at 5th street possible fire | Location: 5th Street Request: Backup Incident: Fire Urgency: High |
This makes communication faster to interpret and act on.
Radio Audio
↓
Whisper (ONNX, on device)
↓
Transcript
↓
Local LLaMA (Genie) [optional]
↓
Structured incident / action summary
Everything runs fully offline.
- Fully offline operation — No internet required for transcription or LLaMA.
- Optimized Whisper — ONNX Runtime with QNN/NPU on Qualcomm hardware.
- Structured extraction — On-device LLaMA (Genie) for action items and suggested actions.
- Designed for Qualcomm AI hardware — Snapdragon X Elite (e.g. Dell Latitude 7455).
- Fast, reliable transcription in noisy environments.
- Simple UI — React (Vite) frontend, FastAPI backend; upload or record, then see raw transcript + Llama output.
- Python (backend, pipeline)
- Whisper (ONNX Runtime, Qualcomm AI Hub / QNN)
- LLaMA (local inference via Qualcomm Genie)
- FastAPI (backend API)
- React + Vite (frontend)
- Qualcomm AI Hub tools, ONNX Runtime
- Python 3.11 (recommended; project uses 3.11.9 on Windows)
- Node.js & npm (for the React frontend)
- FFmpeg — for audio. On Windows: download from FFmpeg Windows builds, extract (e.g. to
C:\Program Files\ffmpeg), and add thebinfolder to your PATH. - Qualcomm AI / QAIRT (for Genie/Llama; only if using on-device LLaMA)
- Windows (tested on Windows 11; Snapdragon X Elite)
No GPU or cloud account required for core transcription; NPU/Genie are used when available.
git clone https://github.com/thatrandomfrenchdude/simple-whisper-transcription.git
cd simple-whisper-transcription
python -m venv whisper-venv
.\whisper-venv\Scripts\Activate.ps1 # Windows
pip install -r requirements.txt
pip install fastapi uvicorn[standard] python-multipart- Create a
modelsfolder at the project root. - Add ONNX encoder/decoder (e.g. from Qualcomm AI Hub or your own export).
Example (AI Hub):
python -m qai_hub_models.models.whisper_base_en.export --target-runtime onnx
then copy the generated encoder/decoder frombuildintomodels. - If you use a different variant (e.g. large-v3-turbo), point
config.yamlat those ONNX files and setmodel_variantaccordingly.
Create config.yaml in the project root (see Whisper Models (ONNX) for a minimal example). At minimum you need encoder_path and decoder_path under models/.
From the project root, with the venv activated:
uvicorn backend.main:app --reload --host 127.0.0.1 --port 8001If port 8001 is in use (e.g. WinError 10013), use another port (e.g. --port 5000) and set the same port in frontend/vite.config.ts → proxy["/api"].target.
In a second terminal:
cd frontend
npm install
npm run devOpen http://localhost:5173. The UI proxies /api to the backend. You can upload audio (WAV, FLAC, OGG, MP3, M4A) or record from the microphone, then run the pipeline and see the raw transcript and (if enabled) the Llama-revised output.
- Sample rate: 16 kHz mono (handled by the pipeline).
- Config: In
config.yamlset at least:encoder_path: e.g.models/WhisperEncoder.onnxdecoder_path: e.g.models/WhisperDecoder.onnx- Optionally
model_variant(e.g.base_enorlarge_v3_turbo) to match your ONNX export.
- Hardware: Built and tested on Snapdragon X Elite (e.g. Dell Latitude 7455, 32 GB RAM, Windows 11). ONNX runs with QNN when the models are present; otherwise CPU fallback.
The app can run Qualcomm Genie (genie-t2t-run.exe) to revise or structure the Whisper transcript (e.g. action items, suggested actions). Genie is a native executable; Python only shells out—no separate Python venv for Llama.
- Dot-source the env script so
genie-t2t-run.exeis on PATH:. .\scripts\setup_genie_env.ps1 - Set the Genie bundle directory (folder with
genie_config.json):$env:GENIE_BUNDLE_DIR = "C:\path\to\your\llama\bundle"
python scripts/run_llama_revision.py --text "bravo two copy that we are oscar mike"Before starting the backend:
$env:ENABLE_LLAMA_REVISION = "1"
uvicorn backend.main:app --reload --host 127.0.0.1 --port 8001The frontend will show the raw Whisper transcript immediately and display Loading... in the “Reconstructed transcript (Llama)” box until the revision request returns. Optional env: GENIE_CONFIG, GENIE_EXE, GENIE_TIMEOUT_S (see llama_on_device/README.md).
The React UI can speak the Whisper transcript (TTS uses the raw transcript, not the Llama output). TTS is online (Deepgram); the rest of the pipeline stays offline.
In the same terminal where you start the backend (venv activated):
PowerShell:
$env:DEEPGRAM_API_KEY = "dg_..."
$env:DEEPGRAM_TTS_MODEL = "aura-2-arcas-en"
uvicorn backend.main:app --reload --host 127.0.0.1 --port 8001If DEEPGRAM_API_KEY is not set, the TTS button is disabled in the UI.
- Backend:
backend/main.py— FastAPI app; transcribe endpoint, optional/api/revisefor Llama, TTS endpoints. - Pipeline:
pipeline/asr.py(Whisper),pipeline/enhance.py(radio DSP),pipeline/audio_io.py(load/resample/save). - Whisper model:
src/model.py— ONNX + QNN wrapper; supports base_en and large_v3_turbo (and variants via config). - Live CLI:
src/LiveTranscriber.py— Live mic → Whisper (no React). - Llama on device:
llama_on_device/— Prompts and Genie subprocess (genie_llama.py,prompts.py). - Frontend:
frontend/— React (Vite); upload/record, transcribe, show raw + reconstructed transcript, latency, export.
- LibriSpeech ASR corpus (OpenSLR 12): https://www.openslr.org/12 — Large-scale (1000 hours) read English speech at 16 kHz; useful for training or evaluating ASR in clean and “other” conditions.
- Whisper: Qualcomm AI Hub Whisper Base En; OpenAI Whisper large-v3-turbo for larger models.
- Base repo: simple-whisper-transcription.
To build a standalone Whisper transcriber (no React/backend):
- With the venv activated:
.\build.ps1orpython build_executable.py - Find
WhisperTranscriber.exeindist/ - Copy
config.yaml,mel_filters.npz, and themodels/folder intodist/ - Run the executable (see BUILD_EXECUTABLE.md for details)
- Input: Noisy walkie-talkie (or any) audio — upload or record from the mic.
- Output: Raw transcript (Whisper) + optional structured/revised summary (Llama), plus latency and export.
- Environment: All core processing runs locally with no internet.
ClearComms makes critical communication understandable and actionable in the exact environments where reliability matters most: when networks are down, latency is critical, and every word can change the outcome.
Contributions are welcome. Please review CONTRIBUTING.md before submitting a pull request.
This project follows the Contributor Covenant (code of conduct).