WhisperX ASR is a production-ready automatic speech recognition (ASR) app powered by WhisperX and FastAPI. It provides a web UI and REST API for transcribing audio files with word-level timestamps, multi-language support, and GPU acceleration.
For advanced configuration, model options, and setup instructions, see the original WhisperX repository: https://github.com/m-bain/whisperX
- FastAPI backend for robust, scalable API serving
- WhisperX model for accurate speech-to-text transcription
- Web UI for uploading audio, recording, and viewing results
- Supports multiple audio formats: WAV, MP3, M4A, FLAC, OGG, WEBM
- Language selection (auto-detect or manual)
- Batch size control for performance tuning
- Word-level timestamps and segments in results
- GPU acceleration (if available)
- Health check endpoint for monitoring model status
- FastAPI
- WhisperX
- Uvicorn
- Python 3.11.5
- HTML/CSS/JS for the frontend
- Python 3.11.5+
- uvicorn
- uv package manager
- CUDA-enabled GPU for acceleration (you can run on CPU but slower)
# Clone the repository
git clone https://github.com/romanyn36/whisperx-automatic-speech-recognition.git
cd whisperx-automatic-speech-recognition
# setup virtual environment
uv sync
.\.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/Mac
# copy the example env file and modify as needed
cp .env.example .env
# Start the server
uv run python main.py --reload
# access the web UI at http://localhost:8000/static/GET /health
- Returns model status, GPU info, and readiness.
GET /languages
- Lists available transcription languages.
POST /transcribe
- Parameters:
file: Audio file (WAV, MP3, M4A, FLAC, OGG, WEBM)language: Language code (e.g.,en,auto)batch_size: Integer (1β32)
- Returns:
transcription: Full textlanguage: Detected languageprocessing_time: Secondssegments: List of segments with timestampsword_segments: List of word-level timestamps
Open http://localhost:8000/ in your browser. Features:
- Upload or record audio
- Select language and batch size
- View transcription, segments, and word-level timestamps
You can adjust model size, alignment, and file size limits via environment variables in .env:
MODEL_SIZE(e.g.,large-v2,medium,tiny) in my gtx 1650 GPU the large-v3-turbo works excellentENABLE_ALIGNMENT(true/false)MAX_FILE_SIZE_MB(default: 100)API_KEY(optional, disabled by default)
as this project is still ongoing and this is just the first development phase there are many features to come
- currently working deploy on gpu (runpod)
- add production ready docker support
- built github actions for auto deployment
- testing and improvements among others whisperx features
This project is licensed under the MIT License. See LICENSE.
